I have trained an NLP model for "Consumer Complaints Classification" using Logistic regression algorithm and TF-IDF vectorizer. I want to know the words that my model associates with a particular class. I am looking for something like this -
Class 1 = ["List of words that help my model identify that an input text belongs to this class"]
I suppose what you need is more something like the most important words (or better tokens) associated with one class. Because usually all tokens will be "associated" with all classes one way or the other. So I will answer your question with the following approach:
Let's assume your tokens (or words) generated by the TfidfVectorizer
are stored in X_train
with labels in y_train
, and you trained a model like:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(corpus)
clf = LogisticRegression()
clf.fit(X_train, y_train)
The coef_
attribute of LogisticRegression
is of shape (n_classes, n_features) for multiclass problems and contains the coefficients calculated for each token and each class. This means, by indexing it according to the classes one can access the coefficients used for this particular class, e.g. coef_[0]
for class 0
, coef_[1]
for class 1
, and so forth.
Just reassociate the token names with the coefficients and sort them according to their value. Then you will get the most important tokens for each class. An example to get the most important tokens for class 0
:
import pandas as pd
important_tokens = pd.DataFrame(
data=clf.coef_[0],
index=vectorizer.get_feature_names(),
columns=['coefficient']
).sort_values(ascending=False)
The tokens in important_tokens
are now sorted according to their importance for class 0
and can be easily extracted via the index values. For example, to get the n most important features as a list: important_tokens.head(n).index.values
.
If you want the most important tokens for other classes, just replace the index of the coef_
attribute as needed.