I have trained a BERTopic model on a dataframe of length of 400k. I want to map the topics of each document in a new column inside the dataframe. I could do that by running a for loop on all the documents and do topic_model.transform(doc)
on them. The only problem is, it takes more than a second to transform each document into its topic and it would take days for the whole dataset.
Is there a way to achieve this faster since I want to map the topics on the training data.
I tried:
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
topics = []
for text in df.texts:
tops = topic_model.transform(text)
topics.append(tops)
df['topics'] = topics
There is no need to recalculate the topics as you already retrieved them when using .fit_transform
. There, the topics
that you retrieve are in the exact same order as the input documents. Therefore, you can perform the following:
# The `topics` that you get here are in the exact same order as `docs`
# `topics[0]` belongs to `docs[0]`, `topics[1]` to `docs[1]`, etc.
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)
# When you used `.fit_transform`:
df = pd.DataFrame({"Document": docs, "Topic": topic})
For those using .fit
instead of .fit_transform
, you can also access the topics and their documents as follows:
# When you used `.fit`:
df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})