Get CountVectorizer to include "1:1"

If I have some text that includes the phrase "1:1". How do I get CountVectorizer to recognize that as a token?

text = ["first ques # 1:1 on stackoverflow", "please help"]
vec = CountVectorizer()
vec.fit_transform(text)

vec.get_feature_names()

Solution

You could use a customized tokenizer. For simple cases replacing

vec = CountVectorizer()

vec = CountVectorizer(tokenizer=lambda s: s.split())

would do. With this modification your code returns:

[u'#', u'1:1', u'first', u'help', u'on', u'please', u'ques', u'stackoverflow']

Hopefully this suggestion will put you on the right track, but notice that such workaround would not work properly in more complex cases (for example if your text has punctuation marks).

To deal with puntuation marks, you could pass CountVectorizer a token pattern like this:

text = [u"first ques... # 1:1, on stackoverflow", u"please, help!"]
vec = CountVectorizer(token_pattern=u'\w:?\w+')

Output:

[u'1:1', u'first', u'help', u'on', u'please', u'ques', u'stackoverflow']