How may one go about getting an overview of most important tokens from a SciKit-learn pipeline with the following components:
multinb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
multinb.fit(X_train, y_train)
Looking for a simple snippet that visualizes/plots the top-weighted tokens overall X)
How about extracting the coef_
of MultinomialNB
:
import pandas as pd
multinb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
multinb.fit(X_train, y_train)
token_imp = pd.DataFrame(
data=multinb['clf'].coef_[0],
index=multinb['vect'].get_feature_names(),
columns=['coefficient']
).sort_values(by='coefficient', ascending=False)
print(token_imp)
This will give you something like feature importances in descending order. Since token_imp
is a dataframe, you can also just view the n most important features by using token_imp.head(n)
and visualize them with token_imp.plot.bar()