Search code examples
pythonscikit-learncountvectorizer

Bug in sklearn CountVectorizer with preprocessor and lowercase?


I do not know if I have encountered a possible bug in the sklearn CountVectorizer or if I am simply misunderstanding something.

I am working with a small corpus of texts which contain a variety of parenthetical strings, only some of which need to be removed. After some experimentation, I decided simply to go with a list of those parentheticals, a subset I am including below:

parentheticals = [ "\(laughter\)", "\(applause\)", "\(music\)", "\(video\)" ]

Because I have found no way to workaround CountVectorizer's requirement that it receive a string, or list of strings, I went with this small regex function:

def clean_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text, flags=re.IGNORECASE)
    return new_text

I then passed this to CountVectorizer as a preprocessor argument:

vec2 = CountVectorizer(preprocessor = clean_parens )
X2 = vec2.fit_transform(texts)

On the first run, I noticed my feature set had grown from 53k to 58k for ~1700 texts. When I inspected the feature names, I saw that I had both uppercase and lower case terms:

print(vec2.get_feature_names())
---
... 'Waves' ... 'waves'

When I included lowercase=True in the CountVectorizer, I got no change in my results. Is this because the preprocessor takes precedence? (This is not how I understood the documentation.)

A simple change to the little regex function above sets everything right:

def clean_parens(text):
    new_text = text
    for rgx_match in parentheticals:
        new_text = re.sub(rgx_match, ' ', new_text.lower(), flags=re.IGNORECASE)
    return new_text

I'm happy with this, but if someone could explain what I misunderstood about the CountVectorizer, that would be great. I feel like it's a cabinet saw, and I'm used to using a handheld circular saw: it's power is somewhere between might and magic in the hands of someone like me.


Solution

  • Great catch!

    I wouldn't look at this as actual bug, but it is a lack of documentation. Possibly there has to be an error/warning message raised when preprocessor is callable and lowercase=True.

    FYI, lower casing happens in the default preprocessor function here. Hence, when you override the preprocessor with a callable, the lower casing would not happen.

    I have raised this issue here.