Search code examples
pythonscikit-learncountvectorizer

Why does CountVectorizer throw an "Empty Vocabulary error" for a bigram when there are two words?


I have a CountVectorizer:

word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')

Implementing that vectorizer:

X = word_vectorizer.fit_transform(group['cleanComments'])

Throws this error:

Traceback (most recent call last):

  File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
    runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
    X = word_vectorizer.fit_transform(group['cleanComments'])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"

ValueError: empty vocabulary; perhaps the documents only contain stop words

This error occurs when the document the nGram is pulling from is this string: "duplicate q". It happens anytime the document is ' '.

Why isn't the CountVectorizer picking up q (or any single letter for that matter) as a valid word? Is there any comprehensive place that lists the possible reasons this error would be thrown for the CountVectorizer?

EDIT: I did some more digging int the error itself, and it looks like it relates to the vocabulary. I'm assuming the standard vocabulary doesn't accept single letters as words, but I'm not sure how to get around that issue.


Solution

  • The _count_vocab() function is throwing this error, which is a method of the CountVectorizer class. The class comes with a token_pattern, which defines what counts as a word. The documentation for the token_pattern argument notes:

    The default regexp select tokens of 2 or more alphanumeric characters

    And we can see this explicitly in the default argument to __init__:

    token_pattern=r"(?u)\b\w\w+\b"
    

    If you want to allow single-letter words, just remove the first \w from this pattern and set token_pattern explicitly when instantiating your CountVectorizer:

    CountVectorizer(token_pattern=r"(?u)\b\w+\b", 
                    stop_words=None, ngram_range=(2,2), analyzer='word')