Search code examples
pythonmachine-learningscikit-learnvalueerrorcountvectorizer

CountVectorizer raising error on short words


Would somebody try to explain me why does CountVectorizer raise this error when i try to fit_transform any short word? Even if i use stopwords=None I still do get the same error. Here is the code

from sklearn.feature_extraction.text import CountVectorizer

text = ['don\'t know when I shall return to the continuation of my scientific work. At the moment I can do absolutely nothing with it, and limit myself to the most necessary duty of my lectures; how much happier I would be to be scientifically active, if only I had the necessary mental freshness.']

cv = CountVectorizer(stop_words=None).fit(text)

and working pretty much as expected. Then if i try to fit_transform with another text

cv.fit_transform(['q'])

and the error raised is

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-acbd560df1a2> in <module>()
----> 1 cv.fit_transform(['q'])

~/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
    867 
    868         vocabulary, X = self._count_vocab(raw_documents,
--> 869                                           self.fixed_vocabulary_)
    870 
    871         if self.binary:

~/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
    809             vocabulary = dict(vocabulary)
    810             if not vocabulary:
--> 811                 raise ValueError("empty vocabulary; perhaps the documents only"
    812                                  " contain stop words")
    813 

ValueError: empty vocabulary; perhaps the documents only contain stop words

I read a few topic about this error, because it seems really It's fairly often error CV raises, but all I found was covering the case where text really cointains only stopwords. I really can't figure out what the problem is in my case, so I would really appreciate it if i get any help!


Solution

  • CountVectorizer(token_pattern='(?u)\\b\\w\\w+\\b') per default tokenizes only words (tokens) containing 2+ characters

    you can change this default behavior:

    vect = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
    

    Test:

    In [29]: vect.fit_transform(['q'])
    Out[29]:
    <1x1 sparse matrix of type '<class 'numpy.int64'>'
            with 1 stored elements in Compressed Sparse Row format>
    
    In [30]: vect.get_feature_names()
    Out[30]: ['q']