Currently, I have a CountVectorizer function
with token_pattern by default used by Sklearn, and I have some results on get_features_names as follows:
I would like to remove numbers and _ symbol. I know that to do this i must to modify the regex function by default: r'(?u)\b\w\w+\b'
so, Any suggestions?
Good words: abrazo, aburrir, extrañar, además
Bad words: anamilan,000,02,10,100,1080
I would like to add ñ,á,é,í,ó,ú, I tried with [á-ú_ñ]+
but it doesn't work.
This pattern should match all the digits and _.