Search code examples
text2vec

text2vec - Do topics' words update with new data?


I'm currently performing a topic modelling using LDA from text2vec package. I managed to create a dtm matrix and then apply LDA and its fit_transform method with n_topics=50.

While looking at the top words from each topic, a question popped into my mind. I plan to apply the model to new data afterwards and there's a possibility of occurence of new words, which were not encountered by the model before. Will the model still be able to assign each word to its respective topic? Moreover, will these words also be added to the topic, so that I will be able to locate them using get_top_words?

Thank you for answering!


Solution

  • Idea of statistical learning is that underlying distributions of "train" data and "test" data are more or less the same. So if your new documents contains totally different distribution you can't expect LDA will magically work. This is true for any other model.

    During inference time topic-word distribution is fixed (it was learned at training stage). So get_top_words will always return same words after model trained.

    And of course new words won't be included automatically - Document-Term matrix constructed from a vocabulary (which you learn before construction of DTM) and new documents will also contain only words from fixed vocabulary.