Search code examples
pythonscikit-learntokenizehashtagcountvectorizer

How to preserve #hashtag and @mention characterizers from Countvectorizer token_pattern


I use sklearn library for extracting word count from tweets. But I have a problem with dropping some specials chars. I want to preserve '#' and '@' chars from CountVectorizer object.

The default token_pattern parameter is: token_pattern='(?u)\b\w\w+\b'

For example on this corpus...

['@terör @terör #terör ak @terör ali ali ...']

...the output is:

['ak', 'ali', 'terör', ...]

CountVectorizer's default regex removes special chars. How can I preserve these chars?


Solution

  • I change the parameter with ;

    token_pattern=r'\b\w\w+\b|(?<!\w)@\w+|(?<!\w)#\w+')
    

    The output comes as desired;

    ['@terör', '#terör', ...]