Search code examples
python-3.xnlpbert-language-modeltopic-modeling

Map BERTopic topic IDs back to the training dataframe


I have trained a BERTopic model on a dataframe of length of 400k. I want to map the topics of each document in a new column inside the dataframe. I could do that by running a for loop on all the documents and do topic_model.transform(doc) on them. The only problem is, it takes more than a second to transform each document into its topic and it would take days for the whole dataset.

Is there a way to achieve this faster since I want to map the topics on the training data.

I tried:

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)
topic_model.reduce_topics(docs, nr_topics=200)

topics = []
for text in df.texts:
    tops = topic_model.transform(text)
    topics.append(tops)
df['topics'] = topics

Solution

  • There is no need to recalculate the topics as you already retrieved them when using .fit_transform. There, the topics that you retrieve are in the exact same order as the input documents. Therefore, you can perform the following:

    # The `topics` that you get here are in the exact same order as `docs`
    # `topics[0]` belongs to `docs[0]`, `topics[1]` to `docs[1]`, etc.
    topic_model = BERTopic()
    topics, probs = topic_model.fit_transform(docs)
    topic_model.reduce_topics(docs, nr_topics=200)
    
    # When you used `.fit_transform`:
    df = pd.DataFrame({"Document": docs, "Topic": topic})
    

    For those using .fit instead of .fit_transform, you can also access the topics and their documents as follows:

    # When you used `.fit`:
    df = pd.DataFrame({"Document": docs, "Topic": topic_model.topics_})