Search code examples
pythonscikit-learnvectorization

CountVectorizer but for group of text


Using the following code, CountVectorizer break "Air-dried meat" into 3 different vector. But What I want is to keep "Air-dried meat" as 1 vector. how do I do it?

The code I run:

from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit(food_names)
#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

Current output:

Our vocabulary:  {'air': 0, 'dried': 3, 'meat': 4, 'almonds': 1, 'amaranth': 2}

Desired outpput:

Our vocabulary:  {'air-dried meat': 3, 'almonds': 1, 'amaranth': 2}

Solution

  • You can use options in CountVectorizer to change behaviour - ie. token_pattern or tokenizer.


    If you use token_pattern='.+'

    CountVectorizer(binary=True, token_pattern='.+')
    

    then it will treat every element on list as single word.

    from sklearn.feature_extraction.text import CountVectorizer
    
    food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
    
    count_vect = CountVectorizer(binary=True, token_pattern='.+')
    bow_rep = count_vect.fit(food_names)
    
    print("Our vocabulary:", count_vect.vocabulary_)
    

    Result:

    Our vocabulary: {'air-dried meat': 0, 'almonds': 1, 'amaranth': 2}
    

    If you use tokenizer=shlex.split

    CountVectorizer(binary=True, tokenizer=shlex.split)
    

    then you can use " " to group words in string

    from sklearn.feature_extraction.text import CountVectorizer
    import shlex
    
    food_names = ['"Air-dried meat" other words', 'Almonds', 'Amaranth']
    
    count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
    bow_rep = count_vect.fit(food_names)
    
    print("Our vocabulary:", count_vect.vocabulary_)
    

    Result:

    Our vocabulary: {'air-dried meat': 0, 'other': 3, 'words': 4, 'almonds': 1, 'amaranth': 2}
    

    BTW: similar question on portal for DataScience

    how to avoid tokenizing w/ sklearn feature extraction


    EDIT:

    You can also convert food_names to lower() and use as vocabulary

    vocabulary = [x.lower() for x in food_names]
    
    count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)
    

    and it also will treat it as single elemement in vocabulary

    from sklearn.feature_extraction.text import CountVectorizer
    
    food_names = ["Air-dried meat", "Almonds", "Amaranth"]
    vocabulary = [x.lower() for x in food_names]
    
    count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)
    
    bow_rep = count_vect.fit(food_names)
    print("Our vocabulary:", count_vect.vocabulary_)
    

    Problem is when you want to use these methods with transform() because only tokenizer=shlex.split splits text in transformed text. But it also need " " in text to catch Air-dried meat

    from sklearn.feature_extraction.text import CountVectorizer
    import shlex
    
    food_names = ['"Air-dried meat" Almonds Amaranth']
    
    count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
    bow_rep = count_vect.fit(food_names)
    print("Our vocabulary:", count_vect.vocabulary_)
    
    text = 'Almonds of Germany'
    temp = count_vect.transform([text])
    print(text, temp.toarray())
    
    text = '"Air-dried meat"'
    temp = count_vect.transform([text])
    print(text, temp.toarray())
    
    text = 'Air-dried meat'
    temp = count_vect.transform([text])
    print(text, temp.toarray())