Background

I’ve frequently found that resources covering LDA are difficult to understand. They are often excessively technical and amount to being a cobbled collection of calculus derivations. The resources that are easier to follow are high level overviews so you get the idea of what LDA accomplishes, but you never fully grasp how it works.

This book focuses on LDA for inference via Gibbs Sampling and attempts to provide a comprehensive overview of both the high level and granular components of LDA. To aid in understanding both LDA and Gibbs sampling, all probability distributions used in LDA will be reviewed along with a variety of different approaches for parameter estimation. Following the introduction of these components, LDA will be presented as a generative model. This will lay the groundwork for understanding how LDA can be used for the inference of topics in a corpus.

I have tried my best to relay an explanation of LDA that fills in the gaps and questions that are sometimes left out of publications. The book contains many code examples, but I do not shy away from walking through mathematical derivations. Where applicable I state mathematical properties used in the derivations so that the reader doesn’t have to ‘take my word for it’, but instead can go from A to B on their own. You will find code examples written in R in the case you would like to try them out at home. I will warn you that my implementation of LDA is not optimized and if you are doing analysis for any reason other than trying to learn, I would suggest using one of the many great peices of open source software available: Mallet, Gensim, LDA(R), topicmodels(R), and scikit.