Search code examples
pythonscikit-learncountvectorizer

Adding numbers to stop_words to scikit-learn's CountVectorizer


This question explains how to add your own words to the built-in English stop words of CountVectorizer. I'm interested in seeing the effects on a classifier of eliminating any numbers as tokens.

ENGLISH_STOP_WORDS is stored as a frozen set, so I guess my question boils down (unless there's a method I don't know) to if it's possible to add an arbitrary number represnetation to a frozen list?

My feeling on the question is that it's not possible, since the finiteness of the list you have to pass precludes that.

I suppose one way to accomplish the same thing would be to loop through the test corpus and pop words where word.isdigit() is true to a set/list that I can then union with ENGLISH_STOP_WORDS (see previous answer), but I'd rather be lazy and pass something simpler to the stop_words parameter.


Solution

  • Instead of extending the stopword list, you can implement this as a custom preprocessor for the CountVectorizer. Below is a simple version of this shown in bpython.

    >>> import re
    >>> cv = CountVectorizer(preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower()))
    >>> cv.fit(['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45'])
    CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
            dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
            lowercase=True, max_df=1.0, max_features=None, min_df=1,
            ngram_range=(1, 1),
            preprocessor=<function <lambda> at 0x109bbcb18>, stop_words=None,
            strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
            tokenizer=None, vocabulary=None)
    >>> cv.vocabulary_
    {u'sentence': 6, u'this': 7, u'is': 4, u'candy': 1, u'dogs': 2, u'second': 5, u'NUM': 0, u'eat': 3}
    

    Precompiling the regexp would likely give some speedup over a large number of samples.