Search code examples
pythonregexscikit-learncountvectorizer

Sklearn CountVectorizer token_pattern that allows strings with any characters


I want to write a token_pattern= for CountVectorizer (docs) that allows any strings to be passed in as tokens.

The default excludes a lot including strings with hyphens...

The closest I have is:

vectorizer = CountVectorizer(token_patten=r"(?u)\b\w\w+\b|!|\?|\"|\'") 

From this post.

But my skill in regex is lacking so I've been unable to successfully customize it.


Solution

  • I figured it out.

    This allows any strings.

    vectorizer = CountVectorizer(token_pattern=r'.*')