Search code examples
gensimldatopic-modelingmallet

(gensim) LdaMallet vs LdaModel?


What is the difference between using gensim.models.LdaMallet and gensim.models.LdaModel? I noticed that the parameters are not all the same and would like to know when one should be used over the other?


Solution

  • TL;DR: Both are two completely independent implementations of Latent Dirichlet Allocation. Use gensim if you simply want to try out LDA and you are not interested in special features of Mallet.

    gensim.models.LdaModel is the single-core version of LDA implemented in gensim. There is also parallelized LDA version available in gensim (gensim.models.ldamulticore). Both Gensim implementations use an online variational Bayes (VB) algorithm for Latent Dirichlet Allocation as described in Hoffman et al. [1].

    Gensim algorithms (not limited to LDA) are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core).

    Gensim also offers wrappers for the popular tools Mallet (Java) and Vowpal Wabbit (C++).

    gensim.models.wrappers.LdaVowpalWabbit uses the same online variational Bayes (VB) algorithm that Gensim’s LdaModel is based on [1].

    gensim.models.wrappers.LdaMallet uses an optimized Gibbs sampling algorithm for Latent Dirichlet Allocation [2]. This is the reason for different parameters. However, most of the parameters, e.g., the number of topics, alpha and (b)eta) are shared between both algorithms because both implement LDA.

    Both wrappers (gensim.models.wrappers.LdaVowpalWabbit and gensim.models.wrappers.LdaMallet) need to have the respective tool installed (independent of gensim). Therefore, gensim is easier to use.

    Besides that, try out the different implementations and see what works for you.

    References

    [1] Hoffman, Matthew, Francis R. Bach, and David M. Blei. "Online learning for latent dirichlet allocation." advances in neural information processing systems. 2010.

    [2] Yao, Limin, David Mimno, and Andrew McCallum. "Efficient methods for topic model inference on streaming document collections." Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. 2009.