Latent Dirichlet Allocation

What is Topic Modeling​

Before the advent of LLMs and transformers, the world of natural language processing (NLP) looked vastly different. Limited by both the available models and the insufficient computing power, this branch of AI tackled problems that were significantly different from the chatbots that we see today.

One task that was of remarkable interest was topic modeling. The main goal in this context is to discover hidden thematic structures in a collection of documents. We effectively seek to identify the underlying topics or themes within a set of textual data without requiring prior knowledge of what those topics might be. A model capable of performing this task is of great help in performing exploratory data analysis and summarizing large bodies of text.

Perhaps the most significant breakthrough in topic modeling occurred in 2003 when Blei, Ng, and Jordan introduced Latent Dirichlet Allocation (LDA). This is a generative probabilistic model that helps discover topics within a corpus of text by representing each document as a mixture of topics. LDA’s strength comes from its flexibility, a trait that arises from the hierarchical structure of the model which defines several parameters at corpus, document, and topic levels. Let us dive into how LDA works, how to learn its parameters from data, and how to apply it in a practical setting.

Bag of Words

Let us first tackle the idea of representing text within an NLP model. The bag-of-words (BoW) model is a simple and foundational technique that addresses this issue. It’s a way to represent text data as a collection of individual words or terms, ignoring the order and structure of the words in the document. The name “Bag of Words” implies that it treats a document as if it were just a bag or a set of words, without considering their sequence or grammar. 

To implement this idea, one would first tokenize the text (break it down into individual words or chunks of words), remove irrelevant words (such as prepositions) and create a vocabulary, meaning a set of all possible words available to you. Think of it as the Cambridge English language dictionary. Following this steps, one would count the occurrences of every word and store into a table.

Documents, Topics, Words

We have been mentioning terms like documents, topics and words, but we have not yet addressed what they mean. Words are pretty intuitive as they refer exactly to what you would expect. Documents are collections of words, which could be sentences, paragraphs, or entire articles. The more interesting idea revolves around topics. Since we do not want to define a set of categories a priori, we need a flexible way of modeling topics which would allow us to learn some representations in an unsupervised way.

In LDA, topics are modeled as a mixture of words, or a probability distribution over the vocabulary to be precise. This means that we do not associate certain words to a given topic, but consider the likelihood of observing a word given a selected topic. This idea is very powerful and comes in very handy when writing the mathematical model by leveraging Bayesian statistics. Taking a step further, we can view documents as mixtures of topics. This means that we do not assign a topic to each document, but acknowledge the possibility that we might observe several topics.

Let us have a look at the following example, where we have a document containing two sentences: A cat and a dog fly through a galaxy of stars. A cat wants to fly towards a mouse. Based on this text, we could assume there are two topics and they are distributed with 60% and 40% probability, respectively. Then, given the topic, we could define a probability distribution over the words. An important remark is that a word can have a positive probability in multiple conditional distributions. This means that we do not classify words by topic, but attempt to approximate the frequency of a word given the underlying topic.

Stepping back, we have a corpus comprised of several documents, each with its own distribution of topics, with each topic having a distribution of words. For this reason, we can say that Latent Dirichlet Allocation is a hierarchical Bayesian model, since it combines several layers.

General Framework​

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

LDA assumes the following generative process for each document w in a corpus D:

For each document:

  1. Choose the topic distribution 𝜭ₘ ~ Dirichlet (𝜶)

  2. For each of the N words wₙ :

a) Choose a topic zₘₙ ~ Multinomial(𝜭ₘ)

b) Choose a word wₘₙ from p(wₘₙ |zₘₙ, β), a multinomial probability conditioned on the topic zn.

Several simplifying assumptions are made in this basic model, some of which we remove in subsequent sections. First, the dimensionality k of the Dirichlet distribution (and thus the dimensionality of the topic variable z) is assumed known and fixed. Second, the word probabilities are parameterized by a k×V matrix β where βₘₙ =  p(wₘₙ=1|zₘₙ=1), which for now we treat as a fixed quantity that is to be estimated. Finally, the Poisson assumption is not critical to anything that follows and more realistic document length distributions can be used as needed. Furthermore, note that N is independent of all the other data generating variables (θ and z). It is thus an ancillary variable and we will generally ignore its randomness in the subsequent development.

Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by:

p(θₘ,zₘₙ,wₘₙ|α,β)= p(θₘ|α)∏p(zₘₙ|θₘ)p(wₘₙ|zₘₙ,β),

Understanding the importance of Dirichlet distributions in LDA requires an understanding of Bayes’ Theorem which states that, if the prior is Dirichlet distributed (𝛂, 𝛃) and the likelihood is multinomial distributed (zₘₙ, wₘₙ), the posterior will be Dirichlet distributed and can therefore be computed. Nevertheless, the posterior distribution is intractable for exact inference, and there are several approximating inference algorithms such as Laplace approximation, Markov chain Monte Carlo (eg. Gibbs sampling) and the choice of a paper authros variational Bayes algorithm which will be discussed the following section.

Variational Inference​

This is a mathematical technique used to estimate these hidden topics. Instead of directly calculating the exact topics, we use a simpler, approximate method.

  • Approximations: We make educated guesses about what the topics might be and how they are distributed in the documents. These guesses are like rough sketches of the real topics.
  • Adjustments: We don’t get the guesses right on the first try. So, we adjust and refine them iteratively. Think of it as gradually improving our sketches to get closer to the real picture.
  • Probabilities: At each step, we calculate the likelihood (probability) that our approximations are correct. This helps us measure how confident we are in our guesses.
  • Convergence: We keep refining our approximations until they stabilize, meaning they stop changing significantly. At this point, we believe our approximations are a good representation of the actual topics.
  • Topic Discovery: Once we have these refined approximations, we can use them to discover what topics are present in the documents and how much of each topic is in each document.

Variational inference uses Jensen’s inequality to obtain an adjustable lower bound on the log likelihood; therefore, variational parameters are chosen by an optimization procedure aimed at finding the tightest possible lower bound. The optimizing values of the variational parameters are found by minimizing the Kullback-Leibler (KL) divergence between the variational distribution and the true posterior distribution. We recommend this blog post about KL divergence to get acquainted with this dissimilarity measure in Bayesian inference.

We then obtain (approximate) empirical Bayes estimates by maximizing this lower bound with respect to the model parameters. We have thus far considered the log likelihood for a single document. Given our assumption of exchangeability for the documents, the overall log likelihood of a corpus  D ={w1, …, wM } is the sum of the log likelihoods for individual documents; moreover, the overall variational lower bound is the sum of the individual variational bounds.

In the variational E-step, discussed before, we maximize the lower bound for the posterior distribution with respect to the variational parameters. In the M-step, we maximize the bound with respect to the model parameters α and β. The overall procedure can thus be viewed as a coordinate ascent in L. By isolating terms and adding Lagrange multipliers we can get the exact value of by setting the derivative to zero. We can also obtain the using iterative methods like Newton-Raphson algorithm.

Conclusion

Latent Dirichlet Allocation remains one of the most popular topic modeling algorithms. Since its conception 20 years ago, this approach has been widely used in empirical applications to summarize large bodies of text. Now you can ask ChatGPT to summarize a document for you, but we believe LDA can still come in very handy. Understanding the model and its use cases can be a great addition to you ML toolbox.

Authors: Kassym Mukhanbetiyar, Matei Gabriel Cosa
1.2K Views
Scroll to top
Close