Search code examples
python-3.xpandasnlpbert-language-modeltopic-modeling

Cast topic modeling outcome to dataframe


I have used BertTopic with KeyBERT to extract some topics from some docs

from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)

Now I can access the topic name

freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)

   Topic    Count   Name
0   -1       1     -1_default_greenbone_gmp_manager
1    0      14      0_http_tls_ssl tls_ssl
2    1      8       1_jboss_console_web_application

and inspect the topics

[('http', 0.0855701486234524),          
 ('tls', 0.061977919455444744),
 ('ssl tls', 0.061977919455444744),
 ('ssl', 0.061977919455444744),
 ('tcp', 0.04551718585531556),
 ('number', 0.04551718585531556)]

[('jboss', 0.14014705432060262),
 ('console', 0.09285308122803233),
 ('web', 0.07323749337563096),
 ('application', 0.0622930523123512),
 ('management', 0.0622930523123512),
 ('apache', 0.05032395169459188)]

What I want is to have a final dataframe that has in one column the topic name and in another column the elements of the topic

expected outcome:

  class                         entities
o http_tls_ssl tls_ssl           HTTP...etc
1 jboss_console_web_application  JBoss, console, etc

and one dataframe with the topic name on different columns

  http_tls_ssl tls_ssl           jboss_console_web_application
o http                           JBoss
1 tls                            console
2 etc                            etc

I did not find out how to do this. Is there a way?


Solution

  • Here is one way to to it:

    Setup

    import pandas as pd
    from bertopic import BERTopic
    from sklearn.datasets import fetch_20newsgroups
    
    docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
    
    topic_model = BERTopic()
    # To keep the example reproducible in a reasonable time, limit to 3,000 docs
    topics, probs = topic_model.fit_transform(docs[:3_000])
    
    df = topic_model.get_topic_info()
    print(df)
    # Output
       Topic  Count                    Name
    0     -1     23         -1_the_of_in_to
    1      0   2635         0_the_to_of_and
    2      1    114          1_the_he_to_in
    3      2    103         2_the_to_in_and
    4      3     59           3_ditto_was__
    5      4     34  4_pool_andy_table_tell
    6      5     32       5_the_to_game_and
    

    First dataframe

    Using Pandas string methods:

    df = (
        df.rename(columns={"Name": "class"})
        .drop(columns=["Topic", "Count"])
        .reset_index(drop=True)
    )
    
    df["entities"] = [
        [item[0] if item[0] else pd.NA for item in topics]
        for topics in topic_model.get_topics().values()
    ]
    
    df = df.loc[~df["class"].str.startswith("-1"), :]  # Remove -1 topic
    
    df["class"] = df["class"].replace(
        "^-?\d+_", "", regex=True
    )  # remove prefix '1_', '2_', ...
    
    print(df)
    # Output
                      class                                                      entities
    1         the_to_of_and                [the, to, of, and, is, in, that, it, for, you]
    2          the_he_to_in               [the, he, to, in, and, that, is, of, his, year]
    3         the_to_in_and             [the, to, in, and, of, he, team, that, was, game]
    4           ditto_was__  [ditto, was, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>]
    5  pool_andy_table_tell  [pool, andy, table, tell, us, well, your, about, <NA>, <NA>]
    6       the_to_game_and           [the, to, game, and, games, espn, on, in, is, have]
    

    Second dataframe

    Using Pandas transpose:

    other_df = df.T.reset_index(drop=True)
    new_col_labels = other_df.iloc[0]  # save first row
    other_df = other_df[1:]  # remove first row
    other_df.columns = new_col_labels
    other_df = pd.DataFrame({col: other_df.loc[1, col] for col in other_df.columns})
    
    print(other_df)
    # Output
      the_to_of_and the_he_to_in the_to_in_and ditto_was__ pool_andy_table_tell the_to_game_and
    0           the          the           the       ditto                 pool             the
    1            to           he            to         was                 andy              to
    2            of           to            in        <NA>                table            game
    3           and           in           and        <NA>                 tell             and
    4            is          and            of        <NA>                   us           games
    5            in         that            he        <NA>                 well            espn
    6          that           is          team        <NA>                 your              on
    7            it           of          that        <NA>                about              in
    8           for          his           was        <NA>                 <NA>              is
    9           you         year          game        <NA>                 <NA>            have