This is typically called topic modeling. Topic modeling is a type of statistical modeling for discovering the abstract “topics” that occur in a collection of documents. For instance, two statements – about meals and about food can probably characterized by the same topic though they do not necessarily use the same vocabulary. Topic models typically leverage word co-occurrences to discover semantic themes in documents. The goal is to usually express a document as a combination of a small number of topiacs.
Common techniques for topic modeling are the following:
- Latent Semantic Analysis:
LSA involves constructing a term-document matrix whose rows correspond to terms and whose columns to documents. Each cell corresponds to the number of times a word occurs in a document. This sparse matrix is a co-occurrence matrix that describes the occurrences of terms in documents. After the construction of the term-document matrix, LSA finds a low-rank approximation of this matrix through SVD (Singular Value Decomposition). Through this each document can be expressed as a low-dimensional vector, where each dimension of the vector corresponds to a topic.
- Latent Dirichlet Allocation: This is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials have a dirichlet prior. The goal is to learn these multinomial proportions using probabilistic inference techniques based on the observed data which is the words/content in the document. The outcome of LDA is one multinomial for each document which is a low dimensional representation of the document and a multinomial for each topic. One can visualize the topic as a combination of all words whose weight is high in the corresponding multinomial.
- Non-negative matrix factorization: NMF is similar to SVD in terms of factorizing the term-document matrix. However while we work with positive and negative values in SVD, in NMF, we try to factorize the term document matrix A (nxm) into two matrices W (nxk) and H (kxm) to get k topics where both W and H are non-negative. Each row of W is a low dimensional representation of a row while each column of H is a low dimensional representation of a document.
In all the above techniques, the number of topics K is assumed to be known in advance.