python pandas scikit-learn countvectorizer

Why is this CountVectorizer output different from my word counts?

I have a dataframe with a column called 'Phrase'. I used the following code to find the 20 most common words in this column:

print(pd.Series(' '.join(film['Phrase']).lower().split()).value_counts()[:20])

This gave me the following output:

s             16981
film           6689
movie          5905
nt             3970
one            3609
like           3071
story          2520
rrb            2438
lrb            2098
good           2043
characters     1882
much           1862
time           1747
comedy         1721
even           1597
little         1575
funny          1522
way            1511
life           1484
make           1396

I later needed to create vector counts for each word. I tried to do so using the following code:

vectorizer = CountVectorizer()
vectorizer.fit(film['Phrase'])
print(vectorizer.vocabulary_)

I won't show the whole output, but the output numbers are different from the output above. For example for the word 'movie' it is 9308, for 'good' it is 6131 and for 'make' it is 8655. Why is this happening? Is the value counts method just counting every column that uses the word rather than counting every occurrence of the word? Have I misunderstood what CountVectorizer object is doing?

Solution

vectorizer.vocabulary_ does not return word frequencies, but according to the documentation:

A mapping of terms to feature indices

What this means is that each of the words in your data gets mapped to an index, which is stored in vectorizer.vocabulary_.

Here is an example to illustrate what is happening:

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

df = pd.DataFrame({"a":["we love music","we love piano"]})

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df['a'])
print(vectorizer.vocabulary_)

>>> {'we': 3, 'love': 0, 'music': 1, 'piano': 2}

This vectorization identifies 4 words in the data, and assigns indices from 0 to 3 to each word. Now, you might ask: "But why do I even care about these indices?" Because once the vectorization is done, you need to keep track of the order of the words in your vectorized object. For instance,

X.toarray()
>>> array([[1, 1, 0, 1],
           [1, 0, 1, 1]], dtype=int64)

Using the vocabulary dictionary, you can hence tell that the first column corresponds to "love", the second to "music", the third to "piano" and the fourth to "we".

Note, this also corresponds to the order of the words in vectorizer.get_feature_names()

vectorizer.get_feature_names()
>>> ['love', 'music', 'piano', 'we']