Search code examples
pythonnlpgensimldatopic-modeling

LDA: topic model gensim gives same set of topics


Why am I getting same set of topics # words in gensim lda model? I used these parameters. I checked there are no duplicate documents in my corpus.

lda_model = gensim.models.ldamodel.LdaModel(corpus=MY_CORPUS,
                                           id2word=WORD_AND_ID,
                                           num_topics=4, 
                                           minimum_probability=minimum_probability,
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto', # symmetric, asymmetric
                                           per_word_topics=True)

Results

[
(0, '0.004*lily + 0.01*rose + 0.00*jasmine'),
(1, '0.005*geometry + 0.07*algebra + 0.01*calculation'),
(2, '0.003*painting + 0.001*brush + 0.01*colors'),
(3, '0.005*geometry + 0.07*algebra + 0.01*calculation')
]

Notice: Topic #1 and #3 are identical.


Solution

  • Each of the topics likely contains a large number of words weighted differently. When a topic is being displayed (e.g. using lda_model.show_topics()) you are going to get only a few words with the largest weights. This does not mean that there are no differences between topics among the remaining vocabulary.

    You can steer the number of displayed words to inspect the remaining weights:

     show_topics(num_topics=4, num_words=10, log=False, formatted=True)
    

    and change num_words parameter to include even more words.

    Now, there is also a possibility that:

    • the number of topics should be different (e.g. 3),
    • or minimum_probability smaller (what is the value you use?),
    • or number of passes larger,
    • chunksize smaller,
    • corpus larger (what is the size?) or stripped off of stop words (did you do that?).

    I encourage you to experiment with different values of these parameters to check if any of the combination works better.