I use sklearn library for extracting word count from tweets. But I have a problem with dropping some specials chars. I want to preserve '#' and '@' chars from CountVectorizer
object.
The default token_pattern parameter is: token_pattern='(?u)\b\w\w+\b'
For example on this corpus...
['@terör @terör #terör ak @terör ali ali ...']
...the output is:
['ak', 'ali', 'terör', ...]
CountVectorizer
's default regex removes special chars. How can I preserve these chars?
I change the parameter with ;
token_pattern=r'\b\w\w+\b|(?<!\w)@\w+|(?<!\w)#\w+')
The output comes as desired;
['@terör', '#terör', ...]