apache-spark machine-learning apache-spark-mllib lda apache-spark-ml

Online learning of LDA model in Spark

Is there a way to train a LDA model in an online-learning fashion, ie. loading a previously train model, and update it with new documents ?

Solution

Answering myself : it is not possible as of now.

Actually, Spark has 2 implementations for LDA model training, and one is OnlineLDAOptimizer. This approach is especially designed to incrementally update the model with mini batches of documents.

The Optimizer implements the Online variational Bayes LDA algorithm, which processes a subset of the corpus on each iteration, and updates the term-topic distribution adaptively.

Original Online LDA paper: Hoffman, Blei and Bach, "Online Learning for Latent Dirichlet Allocation." NIPS, 2010.

Unfortunately, the current mllib API does not allow to load a previously trained LDA model, and add a batch to it.

Some mllib models support an initialModel as starting point for incremental updates (see KMeans, or GMM), but LDA does not currently support that. I filled a JIRA for it : SPARK-20082. Please upvote ;-)

For the record, there's also a JIRA for streaming LDA SPARK-8696