If I have some text that includes the phrase "1:1". How do I get CountVectorizer
to recognize that as a token?
text = ["first ques # 1:1 on stackoverflow", "please help"]
vec = CountVectorizer()
vec.fit_transform(text)
vec.get_feature_names()
You could use a customized tokenizer. For simple cases replacing
vec = CountVectorizer()
by
vec = CountVectorizer(tokenizer=lambda s: s.split())
would do. With this modification your code returns:
[u'#', u'1:1', u'first', u'help', u'on', u'please', u'ques', u'stackoverflow']
Hopefully this suggestion will put you on the right track, but notice that such workaround would not work properly in more complex cases (for example if your text has punctuation marks).
To deal with puntuation marks, you could pass CountVectorizer
a token pattern like this:
text = [u"first ques... # 1:1, on stackoverflow", u"please, help!"]
vec = CountVectorizer(token_pattern=u'\w:?\w+')
Output:
[u'1:1', u'first', u'help', u'on', u'please', u'ques', u'stackoverflow']