This question explains how to add your own words to the built-in English stop words of CountVectorizer
. I'm interested in seeing the effects on a classifier of eliminating any numbers as tokens.
ENGLISH_STOP_WORDS
is stored as a frozen set, so I guess my question boils down (unless there's a method I don't know) to if it's possible to add an arbitrary number represnetation to a frozen list?
My feeling on the question is that it's not possible, since the finiteness of the list you have to pass precludes that.
I suppose one way to accomplish the same thing would be to loop through the test corpus and pop words where word.isdigit()
is true to a set/list that I can then union with ENGLISH_STOP_WORDS
(see previous answer), but I'd rather be lazy and pass something simpler to the stop_words
parameter.
Instead of extending the stopword list, you can implement this as a custom preprocessor
for the CountVectorizer
. Below is a simple version of this shown in bpython
.
>>> import re
>>> cv = CountVectorizer(preprocessor=lambda x: re.sub(r'(\d[\d\.])+', 'NUM', x.lower()))
>>> cv.fit(['This is sentence.', 'This is a second sentence.', '12 dogs eat candy', '1 2 3 45'])
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1),
preprocessor=<function <lambda> at 0x109bbcb18>, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
>>> cv.vocabulary_
{u'sentence': 6, u'this': 7, u'is': 4, u'candy': 1, u'dogs': 2, u'second': 5, u'NUM': 0, u'eat': 3}
Precompiling the regexp would likely give some speedup over a large number of samples.