I am using Sk Learn CountVectorizer
on strings but CountVectorizer
discards all the emojis in the text.
For instance, 👋 Welcome
should give us: ["\xf0\x9f\x91\x8b", "welcome"]
However, when running:
vect = CountVectorizer()
test.fit_transform(['👋 Welcome'])
I only get: ["welcome"]
This has to do with the token_pattern
which does not count the encoded emoji as a word, but is there a custom token_pattern
to deal with emojis?
yes, you are right! token_pattern
has to be changed. Instead of just alpha-numeric characters, we can make it as any character other than white space.
Try this!
from sklearn.feature_extraction.text import TfidfVectorizer
s= ['👋 Welcome', '👋 Welcome']
v = TfidfVectorizer(token_pattern=r'[^\s]+')
v.fit(s)
v.get_feature_names()
# ['welcome', '👋']