python regex scikit-learn countvectorizer

CountVectorizer token_pattern to not catch underscore

CountVectorizer default token pattern defines underscore as a letter

corpus = ['The rain in spain_stays' ]
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

gives:

['in', 'rain', 'spain_stays', 'the']

this makes sense since AFAIK '/w' is eqivilent to [a-zA-z0-9_], what I need is:

['in', 'rain', 'spain', 'stays', 'the']

so I tried replacing the '/w' with [a-zA-Z0-9]

vectorizer = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9][a-zA-Z0-9]+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

but I get

['in', 'rain', 'the']

How can I get what I need? any ideas are welcome

Solution

There is no word boundary between n_ as \w also matches an underscore.

To match 2 or more word characters without an underscore, and allowing only a whitespace boundary or an underscore to the left and right:

(?<![^\s_])[^\W_]{2,}(?![^\s_])

The pattern matches:

(?<![^\s_]) Negative lookbehind, assert a whitspace boundary or an underscore to the left
[^\W_]{2,} Match 2 or more times a word char excluding the underscore
(?![^\s_]) Negative lookahead, assert a whitespace boundary or an underscore to the right

See a regex demo.

A very broad match could be [^\W_]{2,} but note that this does not take boundaries into account. It only matches word characters without the underscore.

See the different amount of matches in this regex demo.