CountVectorizer default token pattern defines underscore as a letter
corpus = ['The rain in spain_stays' ]
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+\b')
X = vectorizer.fit_transform(corpus)
['in', 'rain', 'spain_stays', 'the']
this makes sense since AFAIK '/w' is eqivilent to [a-zA-z0-9_], what I need is:
['in', 'rain', 'spain', 'stays', 'the']
so I tried replacing the '/w' with [a-zA-Z0-9]
vectorizer = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9][a-zA-Z0-9]+\b')
X = vectorizer.fit_transform(corpus)
but I get
['in', 'rain', 'the']
How can I get what I need? any ideas are welcome
There is no word boundary between n_
as \w
also matches an underscore.
To match 2 or more word characters without an underscore, and allowing only a whitespace boundary or an underscore to the left and right:
The pattern matches:
Negative lookbehind, assert a whitspace boundary or an underscore to the left[^\W_]{2,}
Match 2 or more times a word char excluding the underscore(?![^\s_])
Negative lookahead, assert a whitespace boundary or an underscore to the rightSee a regex demo.
A very broad match could be [^\W_]{2,}
but note that this does not take boundaries into account. It only matches word characters without the underscore.
See the different amount of matches in this regex demo.