Search code examples
pythonregexscikit-learncountvectorizer

CountVectorizer token_pattern to not catch underscore


CountVectorizer default token pattern defines underscore as a letter

corpus = ['The rain in spain_stays' ]
vectorizer = CountVectorizer(token_pattern=r'(?u)\b\w\w+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

gives:

['in', 'rain', 'spain_stays', 'the']

this makes sense since AFAIK '/w' is eqivilent to [a-zA-z0-9_], what I need is:

['in', 'rain', 'spain', 'stays', 'the']

so I tried replacing the '/w' with [a-zA-Z0-9]

vectorizer = CountVectorizer(token_pattern=r'(?u)\b[a-zA-Z0-9][a-zA-Z0-9]+\b')
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())

but I get

['in', 'rain', 'the']

How can I get what I need? any ideas are welcome


Solution

  • There is no word boundary between n_ as \w also matches an underscore.

    To match 2 or more word characters without an underscore, and allowing only a whitespace boundary or an underscore to the left and right:

    (?<![^\s_])[^\W_]{2,}(?![^\s_])
    

    The pattern matches:

    • (?<![^\s_]) Negative lookbehind, assert a whitspace boundary or an underscore to the left
    • [^\W_]{2,} Match 2 or more times a word char excluding the underscore
    • (?![^\s_]) Negative lookahead, assert a whitespace boundary or an underscore to the right

    See a regex demo.


    A very broad match could be [^\W_]{2,} but note that this does not take boundaries into account. It only matches word characters without the underscore.

    See the different amount of matches in this regex demo.