Search code examples
pythonnlptopic-modeling

Inspect all probabilities of BERTopic model


Say I build a BERTopic model using

from bertopic import BERTopic
topic_model = BERTopic(n_gram_range=(1, 1), nr_topics=20)
topics, probs = topic_model.fit_transform(docs)

Inspecting probs gives me just a single value for each item in docs.

probs
array([0.51914467, 0.        , 0.        , ..., 1.        , 1.        ,
       1.        ])

I would like the entire probability vector across all topics (so in this case, where nr_topics=20, I want a vector of 20 probabilities for each item in docs). In other words, if I have N items in docs and K topics, I would like an NxK output.


Solution

  • For individual topic probability across each document you need to add one more argument.

    topic_model = BERTopic(n_gram_range=(1, 1), nr_topics=20, calculate_probabilities=True)
    

    Note: This calculate_probabilities = True will only work if you are using HDBSCAN clustering embedding model. And Bertopic by default uses all-MiniLM-L6-v2.

    Official documentation: https://maartengr.github.io/BERTopic/api/bertopic.html

    They have mentioned the same in document as well.