Search code examples
python-3.xnlppackagetopic-modeling

Calculate coherence for non-gensim topic model


I've built a topic model, with:

  • Input: list of tokenized lists
  • Output: a m x t matrix (with each cell indicating the probability of word i appearing in topic k).
  • Output: a k x n matrix (with each cell indicating the probability of topic k in document j).

To find the optimal number of topics, I want to calculate the coherence for a model. However, I am only aware of Gensim's Coherencemodel, which seems to require a Gensim model as input.

Are there any other packages/implementations that I could use to calculate the coherence of a computed topic model? Or, if it is indeed possible to use the Coherencemodel without inputting a LDAmodel, could someone show me how to do that?


Solution

  • Actually, you can do this with the Gensim package.

    input_data = list of list with tokenized texts

    topics = list with top N words per topic

    import gensim.corpora as corpora
    from gensim.models.coherencemodel import CoherenceModel
    
    id2word = corpora.Dictionary(input_data)
    corpus = [id2word.doc2bow(text) for text in input_data]
    
    cm = CoherenceModel(
        topics=topics,
        texts=input_data,
        corpus=corpus,
        dictionary=id2word,
        coherence='c_v')
    coherence = cm.get_coherence()