I want to write a token_pattern=
for CountVectorizer
(docs) that allows any strings to be passed in as tokens.
The default excludes a lot including strings with hyphens...
The closest I have is:
vectorizer = CountVectorizer(token_patten=r"(?u)\b\w\w+\b|!|\?|\"|\'")
From this post.
But my skill in regex is lacking so I've been unable to successfully customize it.
I figured it out.
This allows any strings.
vectorizer = CountVectorizer(token_pattern=r'.*')