Explain latent dirichlet allocation – where is it typically used ?

Latent Dirichlet Allocation is a probabilistic model that models a document as a multinomial mixture of topics and the topics as a multinomial mixture of words. Each of these multinomials have a dirichlet prior. The goal is to learn these multinomial proportions using probabilistic inference techniques based on the observed data which is the words/content in the document.

Let there be $M$ documents, $V$ words in the vocabulary and $K$ be the number of topics we want to find. The LDA can be defined by the following generative process :

→ For each topic $k$, $\phi_k = (\phi_{k1}, \hdots, \phi_{kV})$ is a topic specific multinomial distribution over $V$ words in the vocabulary. Note that $\phi_{kv}$ represents the weightage given to word $v$ in topic $k$. The multinomial $\phi_k$ is generated as $\phi_k \sim Dir(\beta)$.

→ For each document $j, \theta_j=(\theta_{j1}, \hdot, \theta_{jK})$ is a document specific multinomial where each component $\theta_{jk}$ represents the weightage of topic $k$ in the document $j$. The multinomial $\theta_j$ is generated as follows from a dirichlet distribution : $\theta_j \sim Dir(\alpha)$

→ For each word $i$ in document $j$ a topic $z_{ji} \sim \theta_j $ is generated from the document specific multinomial $\theta_j$. Then depending on the topic, a word $w_{ji}$ is generated as $w_{ji} \sim \phi_{z_{ji}} $

Now, the learning happens through the task of inference (typically the techniques used for inference are Gibbs Sampling or Variational Inference) to learn the $\theta_j$s and $\phi_k$s based on the observed data i.e the words in the document.

The outcome of LDA is one multinomial for each document which is a low dimensional representation of the document and a multinomial for each topic. One can visualize the topic as a combination of all words whose weight is high in the corresponding multinomial.

Leave a Reply Cancel reply