CountVectorizer but for group of text

Using the following code, CountVectorizer break "Air-dried meat" into 3 different vector. But What I want is to keep "Air-dried meat" as 1 vector. how do I do it?

The code I run:

from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit(food_names)
#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)

Current output:

Our vocabulary:  {'air': 0, 'dried': 3, 'meat': 4, 'almonds': 1, 'amaranth': 2}

Desired outpput:

Our vocabulary:  {'air-dried meat': 3, 'almonds': 1, 'amaranth': 2}

Solution

You can use options in CountVectorizer to change behaviour - ie. token_pattern or tokenizer.

If you use token_pattern='.+'

CountVectorizer(binary=True, token_pattern='.+')

then it will treat every element on list as single word.

from sklearn.feature_extraction.text import CountVectorizer

food_names = ['Air-dried meat', 'Almonds', 'Amaranth']

count_vect = CountVectorizer(binary=True, token_pattern='.+')
bow_rep = count_vect.fit(food_names)

print("Our vocabulary:", count_vect.vocabulary_)

Result:

Our vocabulary: {'air-dried meat': 0, 'almonds': 1, 'amaranth': 2}

If you use tokenizer=shlex.split

CountVectorizer(binary=True, tokenizer=shlex.split)

then you can use " " to group words in string

from sklearn.feature_extraction.text import CountVectorizer
import shlex

food_names = ['"Air-dried meat" other words', 'Almonds', 'Amaranth']

count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)

print("Our vocabulary:", count_vect.vocabulary_)

Result:

Our vocabulary: {'air-dried meat': 0, 'other': 3, 'words': 4, 'almonds': 1, 'amaranth': 2}

BTW: similar question on portal for DataScience

how to avoid tokenizing w/ sklearn feature extraction

EDIT:

You can also convert food_names to lower() and use as vocabulary

vocabulary = [x.lower() for x in food_names]

count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)

and it also will treat it as single elemement in vocabulary

from sklearn.feature_extraction.text import CountVectorizer

food_names = ["Air-dried meat", "Almonds", "Amaranth"]
vocabulary = [x.lower() for x in food_names]

count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)

bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)

Problem is when you want to use these methods with transform() because only tokenizer=shlex.split splits text in transformed text. But it also need " " in text to catch Air-dried meat

from sklearn.feature_extraction.text import CountVectorizer
import shlex

food_names = ['"Air-dried meat" Almonds Amaranth']

count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)

text = 'Almonds of Germany'
temp = count_vect.transform([text])
print(text, temp.toarray())

text = '"Air-dried meat"'
temp = count_vect.transform([text])
print(text, temp.toarray())

text = 'Air-dried meat'
temp = count_vect.transform([text])
print(text, temp.toarray())