I am trying to use BERTopic
to analyze the topic distribution of documents, after BERTopic
is performed, I would like to calculate the probabilities under respective topics per document, how should I did it?
# define model
model = BERTopic(verbose=True,
vectorizer_model=vectorizer_model,
embedding_model='paraphrase-MiniLM-L3-v2',
min_topic_size= 50,
nr_topics=10)
# train model
headline_topics, _ = model.fit_transform(df1.review_processed3)
# examine one of the topic
a_topic = freq.iloc[0]["Topic"] # Select the 1st topic
model.get_topic(a_topic) # Show the words and their c-TF-IDF scores
Below is the words and their c-TF-IDF scores for one of the Topics image 1
How should I change the result into Topic Distribution as below in order to calculate the topic distribution score and also identify the main topic? image 2
First, to compute probabilities, you have to add to your model definition calculate_probabilities=True
(this could slow down the extraction of topics if you have many documents, > 100000).
# define model
model = BERTopic(verbose=True,
vectorizer_model=vectorizer_model,
embedding_model='paraphrase-MiniLM-L3-v2',
min_topic_size= 50,
nr_topics=10,
calculate_probabilities=True)
Then, calling fit_transform
, you should save the probabilities:
headline_topics, probs = model.fit_transform(df1.review_processed3)
Now, you can create a pandas dataframe which shows probabilities under respective topics per document.
import pandas as pd
probs_df=pd.DataFrame(probs)
probs_df['main percentage'] = pd.DataFrame({'max': probs_df.max(axis=1)})