Search code examples
regexscikit-learncountvectorizer

Remove Numbers and Symbols with Regex on CountVectorizer


Currently, I have a CountVectorizer function

CountVectorizer(stop_words=stopwords.words('spanish'),token_pattern=r'(?u)\b\w\w+\b')

with token_pattern by default used by Sklearn, and I have some results on get_features_names as follows:

000,02,10,100,1080,11,14,17,19,1994,1ª,2015,2017,22,24horas,2t0s6dgxnm,30,31,32,_aitor,_anamilan_,_cuteresa,_raquel97_

I would like to remove numbers and _ symbol. I know that to do this i must to modify the regex function by default: r'(?u)\b\w\w+\b' so, Any suggestions?

Thanks.

UPDATE:

Good words: abrazo, aburrir, extrañar, además

Bad words: anamilan,000,02,10,100,1080

I would like to add ñ,á,é,í,ó,ú, I tried with [á-ú_ñ]+ but it doesn't work.


Solution

  • This pattern should match all the digits and _.

    [\d_]