python scikit-learn tf-idf tfidfvectorizer

understanding top n tfidf features in TfidfVectorizer

I am trying to understand the TfidfVectorizer of scikit-learn a bit better. The following code has two documents doc1 = The car is driven on the road,doc2 = The truck is driven on the highway. By calling fit_transform a vectorized matrix of tf-idf weights is generated.

According to the tf-idf value matrix, shouldn't highway,truck,car be the top words instead of highway,truck,driven as highway = truck= car= 0.63 and driven = 0.44?

#testing tfidfvectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(tokenizer= lambda x:x.split(),stop_words = 'english')
response = vectorizer.fit_transform(tn)

feature_array = np.array(vectorizer.get_feature_names()) #list of features
print(feature_array)
print(response.toarray())

sorted_features = np.argsort(response.toarray()).flatten()[:-1] #index of highest valued features
print(sorted_features)

#printing top 3 weighted features
n = 3
top_n = feature_array[sorted_features][:n]
print(top_n)

['car' 'driven' 'highway' 'road' 'truck']
[[0.6316672  0.44943642 0.         0.6316672  0.        ]
 [0.         0.44943642 0.6316672  0.         0.6316672 ]]
[2 4 1 0 3 0 3 1 2]
['highway' 'truck' 'driven']

Solution

As you can see from the result, the tf-idf matrix is indeed giving a higher score to highway,truck,car (and truck):

tn = ['The car is driven on the road', 'The truck is driven on the highway']
vectorizer = TfidfVectorizer(stop_words = 'english')
response = vectorizer.fit_transform(tn)
terms = vectorizer.get_feature_names()

pd.DataFrame(response.toarray(), columns=terms)

        car    driven   highway      road     truck
0  0.631667  0.449436  0.000000  0.631667  0.000000
1  0.000000  0.449436  0.631667  0.000000  0.631667

What's wrong is the further check you do by flattening the array. To get the top scores accross all rows, you could instead do something like:

max_scores = response.toarray().max(0).argsort()
np.array(terms)[max_scores[-4:]]
array(['car', 'highway', 'road', 'truck'], dtype='<U7')

Where the highest scores are the feature_names that have a 0.63 score in the dataframe.