Search code examples
pythonscikit-learncountvectorizer

Get CountVectorizer to include "1:1"


If I have some text that includes the phrase "1:1". How do I get CountVectorizer to recognize that as a token?

text = ["first ques # 1:1 on stackoverflow", "please help"]
vec = CountVectorizer()
vec.fit_transform(text)

vec.get_feature_names()

Solution

  • You could use a customized tokenizer. For simple cases replacing

    vec = CountVectorizer()
    

    by

    vec = CountVectorizer(tokenizer=lambda s: s.split())
    

    would do. With this modification your code returns:

    [u'#', u'1:1', u'first', u'help', u'on', u'please', u'ques', u'stackoverflow']
    

    Hopefully this suggestion will put you on the right track, but notice that such workaround would not work properly in more complex cases (for example if your text has punctuation marks).

    To deal with puntuation marks, you could pass CountVectorizer a token pattern like this:

    text = [u"first ques... # 1:1, on stackoverflow", u"please, help!"]
    vec = CountVectorizer(token_pattern=u'\w:?\w+')
    

    Output:

    [u'1:1', u'first', u'help', u'on', u'please', u'ques', u'stackoverflow']