Search code examples
pythonscikit-learnnlpcountvectorizer

Sk Learn CountVectorizer: keeping emojis as words


I am using Sk Learn CountVectorizer on strings but CountVectorizer discards all the emojis in the text.

For instance, 👋 Welcome should give us: ["\xf0\x9f\x91\x8b", "welcome"]

However, when running:

vect = CountVectorizer()
test.fit_transform(['👋 Welcome'])

I only get: ["welcome"]

This has to do with the token_pattern which does not count the encoded emoji as a word, but is there a custom token_pattern to deal with emojis?


Solution

  • yes, you are right! token_pattern has to be changed. Instead of just alpha-numeric characters, we can make it as any character other than white space.

    Try this!

    from sklearn.feature_extraction.text import TfidfVectorizer
    s= ['👋 Welcome', '👋 Welcome']
    
    v = TfidfVectorizer(token_pattern=r'[^\s]+')
    v.fit(s)
    v.get_feature_names()
    
    # ['welcome', '👋']