Search code examples
scikit-learntokenizecountvectorizer

Custom tokenizer not working in countvectorizer sklearn


I am trying to make a Countvectorizer with a custom tokenizer function. I am facing a weird problem with it. In below code temp_tok is a list of 5 values which is used as vocabulary later.

temp_tok = ["or", "Normal sinus rhythm", "sinus", "anuj","Normal sinus"]

def tokenize(text):
    return [temp_tok[0],temp_tok[1], "sinus", "Normal sinus"]

def tokenize2(text):
    return [i for i in temp_tok if i in text]

text = "Normal sinus rhythm"

The output of text for both functions is same which is

tokenize(text)
output = ['or', 'Normal sinus rhythm', 'sinus', 'Normal sinus']

But when I build vectorizer with these tokenizer, it gives unexpected output for tokenize2. My vocabulary is temp_tok for both. I experimented with n_gram range but it is not helping.

vectorizer = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize)
vectorizer2 = CountVectorizer(vocabulary=temp_tok,tokenizer = tokenize2)

While vectorizer.transform([text]) is giving expected output, vectorizer2.transform([text]) is giving 1 only for "or" and "sinus"

vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 1, 1, 0, 1]])

vectorizer.transform(["Normal sinus rhythm"]).toarray()
array([[1, 0, 1, 0, 0]])

I also tried passing dictionary instead of list temp_tok as vocabulary to Countvectorizer but it doesn't help. Is this sklearn problem or I am doing something wrong?


Solution

  • Countvectorizer is passing the text by converting it to lower case. So tokenize2 is not working while tokenize works well. This can be seen by adding a print function in tokenize2.

    def tokenize2(text):
    print(text)
    return [i for i in temp_tok if i in text]
    

    A good solution would be to change the elements in temp_tok to lower cases. Else any technique to handle small case, capital case would work.