I have a dataframe with sentences which I used countvectorizer on with a pre-defined vocabulary. For some of the vocabulary words, the return is 0 even though the sentences include the words in the dictionary. the list of words that for some reason do not work are:
* 1 time
* 1 report
* 7 increase
* not a good fit
* not a great fit
* c level
* not a need
the CountVectorizer is defined as follows:
CountVectorizer(vocabulary=cols,ngram_range=(1,5))
where cols is the dictionary
I'm pretty sure this has to do with the tokenizer definitions but not sure how to change it to what I need any help would be appreciated Thanks!
Just found the solution on another post. As expected, the default tokenization in CountVectorizer removes all special characters, punctuation and single characters which was my problem. All I needed to do is change to token pattern as follows:
vectorizer = CountVectorizer(vocabulary=cols,ngram_range=(1,5),token_pattern = r"(?u)\b\w+\b")
You can see the full explanation here: full explanation