Search code examples
pythonscikit-learnnlptf-idftfidfvectorizer

Best way to retrieve top tokens in TF-IDF models


How may one go about getting an overview of most important tokens from a SciKit-learn pipeline with the following components:

multinb = Pipeline([('vect', CountVectorizer()),
           ('tfidf', TfidfTransformer()),
           ('clf', MultinomialNB()),
          ])

multinb.fit(X_train, y_train)

Looking for a simple snippet that visualizes/plots the top-weighted tokens overall X)


Solution

  • How about extracting the coef_ of MultinomialNB:

    import pandas as pd
    
    
    multinb = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
              ])
    
    multinb.fit(X_train, y_train)
    
    token_imp = pd.DataFrame(
        data=multinb['clf'].coef_[0],
        index=multinb['vect'].get_feature_names(),
        columns=['coefficient']
    ).sort_values(by='coefficient', ascending=False)
    
    print(token_imp)
    

    This will give you something like feature importances in descending order. Since token_imp is a dataframe, you can also just view the n most important features by using token_imp.head(n) and visualize them with token_imp.plot.bar()