Sklearn CountVectorizer token_pattern that allows strings with any characters

I want to write a token_pattern= for CountVectorizer (docs) that allows any strings to be passed in as tokens.

The default excludes a lot including strings with hyphens...

The closest I have is:

vectorizer = CountVectorizer(token_patten=r"(?u)\b\w\w+\b|!|\?|\"|\'")

But my skill in regex is lacking so I've been unable to successfully customize it.

Solution

I figured it out.

This allows any strings.

vectorizer = CountVectorizer(token_pattern=r'.*')