I do not know if I have encountered a possible bug in the sklearn CountVectorizer
or if I am simply misunderstanding something.
I am working with a small corpus of texts which contain a variety of parenthetical strings, only some of which need to be removed. After some experimentation, I decided simply to go with a list of those parentheticals, a subset I am including below:
parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)" ]
Because I have found no way to workaround CountVectorizer
's requirement that it receive a string, or list of strings, I went with this small regex function:
def clean_parens(text):
new_text = text
for rgx_match in parentheticals:
new_text = re.sub(rgx_match, ' ', new_text, flags=re.IGNORECASE)
return new_text
I then passed this to CountVectorizer
as a preprocessor
argument:
vec2 = CountVectorizer(preprocessor = clean_parens )
X2 = vec2.fit_transform(texts)
On the first run, I noticed my feature set had grown from 53k to 58k for ~1700 texts. When I inspected the feature names, I saw that I had both uppercase and lower case terms:
print(vec2.get_feature_names())
---
... 'Waves' ... 'waves'
When I included lowercase=True
in the CountVectorizer
, I got no change in my results. Is this because the preprocessor
takes precedence? (This is not how I understood the documentation.)
A simple change to the little regex function above sets everything right:
def clean_parens(text):
new_text = text
for rgx_match in parentheticals:
new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
return new_text
I'm happy with this, but if someone could explain what I misunderstood about the CountVectorizer
, that would be great. I feel like it's a cabinet saw, and I'm used to using a handheld circular saw: it's power is somewhere between might and magic in the hands of someone like me.
Great catch!
I wouldn't look at this as actual bug, but it is a lack of documentation. Possibly there has to be an error/warning message raised when preprocessor
is callable
and lowercase=True
.
FYI, lower casing happens in the default preprocessor function here. Hence, when you override the preprocessor with a callable, the lower casing would not happen.
I have raised this issue here.