Sk Learn CountVectorizer: keeping emojis as words

I am using Sk Learn CountVectorizer on strings but CountVectorizer discards all the emojis in the text.

For instance, 👋 Welcome should give us: ["\xf0\x9f\x91\x8b", "welcome"]

However, when running:

vect = CountVectorizer()
test.fit_transform(['👋 Welcome'])

I only get: ["welcome"]

This has to do with the token_pattern which does not count the encoded emoji as a word, but is there a custom token_pattern to deal with emojis?

Solution

yes, you are right! token_pattern has to be changed. Instead of just alpha-numeric characters, we can make it as any character other than white space.

Try this!

from sklearn.feature_extraction.text import TfidfVectorizer
s= ['👋 Welcome', '👋 Welcome']

v = TfidfVectorizer(token_pattern=r'[^\s]+')
v.fit(s)
v.get_feature_names()

# ['welcome', '👋']