Search code examples
python-3.xscikit-learncountvectorizer

fit_transform error using CountVectorizer


So I have a dataframe X which looks something like this:

X.head()

    0    My wife took me here on my birthday for breakf...
    1    I have no idea why some people give bad review...
    3    Rosie, Dakota, and I LOVE Chaparral Dog Park!!...
    4    General Manager Scott Petello is a good egg!!!...
    6    Drop what you're doing and drive here. After I...
    Name: text, dtype: object

And then,

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(X)

But I get this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-61-8ff79b91e317> in <module>()
----> 1 X = cv.fit_transform(X)

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
    867 
    868         vocabulary, X = self._count_vocab(raw_documents,
--> 869                                           self.fixed_vocabulary_)
    870 
    871         if self.binary:

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
    790         for doc in raw_documents:
    791             feature_counter = {}
--> 792             for feature in analyze(doc):
    793                 try:
    794                     feature_idx = vocabulary[feature]

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(doc)
    264 
    265             return lambda doc: self._word_ngrams(
--> 266                 tokenize(preprocess(self.decode(doc))), stop_words)
    267 
    268         else:

~/anaconda3/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in <lambda>(x)
    230 
    231         if self.lowercase:
--> 232             return lambda x: strip_accents(x.lower())
    233         else:
    234             return strip_accents

~/anaconda3/lib/python3.6/site-packages/scipy/sparse/base.py in __getattr__(self, attr)
    574             return self.getnnz()
    575         else:
--> 576             raise AttributeError(attr + " not found")
    577 
    578     def transpose(self, axes=None, copy=False):

AttributeError: lower not found

No idea why.


Solution

  • You need to specify the column name of the text data even if the dataframe has single column.

    X_countMatrix = cv.fit_transform(X['text'])
    

    Because a CountVectorizer expects an iterable as input and when you supply a dataframe as an argument, only thing thats iterated is the column names. So even if you did not have any errors, that would be incorrect. Lucky that you got an error and got a chance to correct it.