What is the difference between using gensim.models.LdaMallet
and gensim.models.LdaModel
? I noticed that the parameters are not all the same and would like to know when one should be used over the other?
TL;DR: Both are two completely independent implementations of Latent Dirichlet Allocation. Use gensim if you simply want to try out LDA and you are not interested in special features of Mallet.
gensim.models.LdaModel
is the single-core version of LDA implemented in gensim.
There is also parallelized LDA version available in gensim (gensim.models.ldamulticore
).
Both Gensim implementations use an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation as described in Hoffman et al. [1].
Gensim algorithms (not limited to LDA) are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core).
Gensim also offers wrappers for the popular tools Mallet (Java) and Vowpal Wabbit (C++).
gensim.models.wrappers.LdaVowpalWabbit
uses the same online variational Bayes (VB) algorithm that Gensim’s LdaModel is based on [1].
gensim.models.wrappers.LdaMallet
uses an optimized Gibbs sampling algorithm for Latent Dirichlet Allocation [2].
This is the reason for different parameters.
However, most of the parameters, e.g., the number of topics, alpha and (b)eta) are shared between both algorithms because both implement LDA.
Both wrappers (gensim.models.wrappers.LdaVowpalWabbit
and
gensim.models.wrappers.LdaMallet
) need to have the respective tool installed (independent of gensim). Therefore, gensim is easier to use.
Besides that, try out the different implementations and see what works for you.
[1] Hoffman, Matthew, Francis R. Bach, and David M. Blei. "Online learning for latent dirichlet allocation." advances in neural information processing systems. 2010.
[2] Yao, Limin, David Mimno, and Andrew McCallum. "Efficient methods for topic model inference on streaming document collections." Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009.