python scikit-learn tokenize hashtag countvectorizer

How to preserve #hashtag and @mention characterizers from Countvectorizer token_pattern

I use sklearn library for extracting word count from tweets. But I have a problem with dropping some specials chars. I want to preserve '#' and '@' chars from CountVectorizer object.

The default token_pattern parameter is: token_pattern='(?u)\b\w\w+\b'

For example on this corpus...

['@terör @terör #terör ak @terör ali ali ...']

...the output is:

['ak', 'ali', 'terör', ...]

CountVectorizer's default regex removes special chars. How can I preserve these chars?

Solution

I change the parameter with ;

token_pattern=r'\b\w\w+\b|(?<!\w)@\w+|(?<!\w)#\w+')

The output comes as desired;

['@terör', '#terör', ...]