Consider the below example. the important words which represent the documents are 'Bob' and 'Sara'. but with the max_features
, the output tends to show frequent words. This will get worse when the corpus is big. How can we only get the important words?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
'hi, my name is Bob.',
'hi, my name is Sara.'
]
vectorizer = TfidfVectorizer(max_features=2)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
The output:
,hi,is
0,0.7071067811865475,0.7071067811865475
1,0.7071067811865475,0.7071067811865475
If you increase the max_features:
vectorizer = TfidfVectorizer(max_features=10)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
print(df)
bob hi is my name sara
0 0.574962 0.40909 0.40909 0.40909 0.40909 0.000000
1 0.000000 0.40909 0.40909 0.40909 0.40909 0.574962
You can see that sara and bob are really important, since tfidf is higher for those and smaller and equal for the other, what makes sense since are repeated in both sentences.
Notice that as in here. As in max_features
:
"If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus." So it maybe remove the more useful words like in previous case.
Perhaps you may be interested more in the option max_df
or min_df
:
vectorizer = TfidfVectorizer(max_df=0.5)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
print(df)
bob sara
0 1.0 0.0
1 0.0 1.0
Perhaps is best to try different approach until you get a sense of what is going on.
From another point of view it could be good to remove some of the stop words too.