python scikit-learn nlp tf-idf tfidfvectorizer

How to find important words using TfIdfVectorizer?

Consider the below example. the important words which represent the documents are 'Bob' and 'Sara'. but with the max_features, the output tends to show frequent words. This will get worse when the corpus is big. How can we only get the important words?

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd


corpus = [
    'hi, my name is Bob.',
    'hi, my name is Sara.'
]

vectorizer = TfidfVectorizer(max_features=2)
X = vectorizer.fit_transform(corpus).todense()


df = pd.DataFrame(X, columns=vectorizer.get_feature_names())

The output:

,hi,is
0,0.7071067811865475,0.7071067811865475
1,0.7071067811865475,0.7071067811865475

Solution

If you increase the max_features:

vectorizer = TfidfVectorizer(max_features=10)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
print(df)
   bob       hi       is       my     name      sara 
0  0.574962  0.40909  0.40909  0.40909  0.40909  0.000000 
1  0.000000  0.40909  0.40909  0.40909  0.40909  0.574962

You can see that sara and bob are really important, since tfidf is higher for those and smaller and equal for the other, what makes sense since are repeated in both sentences.

Notice that as in here. As in max_features: "If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus." So it maybe remove the more useful words like in previous case.

Perhaps you may be interested more in the option max_df or min_df:

vectorizer = TfidfVectorizer(max_df=0.5)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
print(df)
   bob  sara
0  1.0   0.0
1  0.0   1.0

Perhaps is best to try different approach until you get a sense of what is going on.

From another point of view it could be good to remove some of the stop words too.