Search code examples
pythonscikit-learncountvectorizer

CountVectorizer fit-transform() not working for custom token_pattern


I need to use CountVectorizer on text that contains names of programming languages like 'R','C' etc . But CountVectorizer discards "words" that contain only one character.

    cv1 = CountVectorizer(min_df=2, stop_words='english')
    tokenize = cv1.build_tokenizer()
    tokenize("Python, Time Series, Cloud, Data Modeling, R")

Output:

Out[172]: ['Python', 'Time', 'Series', 'Cloud', 'Data', 'Modeling']

I then tweak the 'token_pattern' so that it considers 'R' also as a token.

    cv1 = CountVectorizer(min_df=1, stop_words='english', token_pattern=r'(?u)\b\w\w+\b|R|C' ,tokenizer=None)
    tokenize = cv1.build_tokenizer()
    tokenize("Python, Time Series, Cloud, R ,Data Modeling") 

Output : Out[187]: ['Python', 'Time', 'Series', 'Cloud', 'R', 'Data', 'Modeling']

But ,

    cvmatrix1 = cv1.fit_transform(["Python, Time Series, Cloud, R ,Data Modeling"])
    cv1.vocabulary_ 

Gives the output :

Out[189]: {'cloud': 0, 'data': 1, 'modeling': 2, 'python': 3, 'series': 4, 'time': 5}

Why is this happening?`


Solution

  • The reason that R is dropped is that the regex captures the capital letter R, where the actual input of the tokenizer will be in lower case. The reason behind that is that the pre-processor call the .lower() function on the original string before tokenizing it:

    tokenize = cv1.build_tokenizer()
    preprocess = cv1.build_preprocessor()
    tokenize(preprocess("Python, Time Series, Cloud, R ,Data Modeling"))
    

    Yields:

    ['python', 'time', 'series', 'cloud', 'data', 'modeling']