Search code examples
pythonnlpbert-language-modeltopic-modeling

How to calculate per document probabilities under respective topics with BERTopics?


I am trying to use BERTopic to analyze the topic distribution of documents, after BERTopic is performed, I would like to calculate the probabilities under respective topics per document, how should I did it?

# define model
model = BERTopic(verbose=True,
                 vectorizer_model=vectorizer_model,
                 embedding_model='paraphrase-MiniLM-L3-v2',
                 min_topic_size= 50,
                 nr_topics=10)

#  train model
headline_topics, _ = model.fit_transform(df1.review_processed3)

# examine one of the topic
a_topic = freq.iloc[0]["Topic"] # Select the 1st topic
model.get_topic(a_topic) # Show the words and their c-TF-IDF scores

Below is the words and their c-TF-IDF scores for one of the Topics image 1

How should I change the result into Topic Distribution as below in order to calculate the topic distribution score and also identify the main topic? image 2


Solution

  • First, to compute probabilities, you have to add to your model definition calculate_probabilities=True (this could slow down the extraction of topics if you have many documents, > 100000).

    # define model
    model = BERTopic(verbose=True,
                     vectorizer_model=vectorizer_model,
                     embedding_model='paraphrase-MiniLM-L3-v2',
                     min_topic_size= 50,
                     nr_topics=10,
                     calculate_probabilities=True)
    

    Then, calling fit_transform, you should save the probabilities:

    headline_topics, probs = model.fit_transform(df1.review_processed3)
    

    Now, you can create a pandas dataframe which shows probabilities under respective topics per document.

    import pandas as pd
    probs_df=pd.DataFrame(probs)
    probs_df['main percentage'] = pd.DataFrame({'max': probs_df.max(axis=1)})