Search code examples
pythonscikit-learncountvectorizer

CountVectorizer tokenizer


I have a dataframe with sentences which I used countvectorizer on with a pre-defined vocabulary. For some of the vocabulary words, the return is 0 even though the sentences include the words in the dictionary. the list of words that for some reason do not work are:

* 1 time
* 1 report
* 7 increase
* not a good fit
* not a great fit
* c level
* not a need

the CountVectorizer is defined as follows:

CountVectorizer(vocabulary=cols,ngram_range=(1,5))

where cols is the dictionary

I'm pretty sure this has to do with the tokenizer definitions but not sure how to change it to what I need any help would be appreciated Thanks!


Solution

  • Just found the solution on another post. As expected, the default tokenization in CountVectorizer removes all special characters, punctuation and single characters which was my problem. All I needed to do is change to token pattern as follows:

    vectorizer = CountVectorizer(vocabulary=cols,ngram_range=(1,5),token_pattern = r"(?u)\b\w+\b")
    

    You can see the full explanation here: full explanation