What are popular ways of dimensionality reduction in NLP tasks ? Do you think this is even important ?

Common representation is bag of words that is very high dimensional given high vocab size. Commonly used ways for dimensionality reduction in NLP :

  1. TF-IDF : Term frequency, inverse document frequency (link to relevant article)
  2. Word2Vec / Glove : These are very popular recently. They are obtained  by leveraging word co-occurrence, through an encoder – decoder setting in a deep neural network. (** give references ). A document embedding is obtained  by averaging embeddings of all words in the document.
  3. Elmo Embeddings: Deep contextual embeddings – Elmo might give a slightly different embedding for each context a word occurs in.   
  4. LSI : Latent semantic Indexing ( based on Singular Value Decomposition (SVD))
  5. Topic Modeling : Techniques such as latent dirichlet allocation that find relevant topics in document collection and represent the document as a reduced dimensional vector of topic strengths.

Leave a Reply

Your email address will not be published. Required fields are marked *