Using the following code, CountVectorizer break "Air-dried meat" into 3 different vector. But What I want is to keep "Air-dried meat" as 1 vector. how do I do it?
The code I run:
from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True)
bow_rep = count_vect.fit(food_names)
#Look at the vocabulary mapping
print("Our vocabulary: ", count_vect.vocabulary_)
Current output:
Our vocabulary: {'air': 0, 'dried': 3, 'meat': 4, 'almonds': 1, 'amaranth': 2}
Desired outpput:
Our vocabulary: {'air-dried meat': 3, 'almonds': 1, 'amaranth': 2}
You can use options in CountVectorizer to change behaviour - ie. token_pattern
or tokenizer
.
If you use token_pattern='.+'
CountVectorizer(binary=True, token_pattern='.+')
then it will treat every element on list as single word.
from sklearn.feature_extraction.text import CountVectorizer
food_names = ['Air-dried meat', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True, token_pattern='.+')
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
Result:
Our vocabulary: {'air-dried meat': 0, 'almonds': 1, 'amaranth': 2}
If you use tokenizer=shlex.split
CountVectorizer(binary=True, tokenizer=shlex.split)
then you can use " "
to group words in string
from sklearn.feature_extraction.text import CountVectorizer
import shlex
food_names = ['"Air-dried meat" other words', 'Almonds', 'Amaranth']
count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
Result:
Our vocabulary: {'air-dried meat': 0, 'other': 3, 'words': 4, 'almonds': 1, 'amaranth': 2}
BTW: similar question on portal for DataScience
how to avoid tokenizing w/ sklearn feature extraction
EDIT:
You can also convert food_names
to lower()
and use as vocabulary
vocabulary = [x.lower() for x in food_names]
count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)
and it also will treat it as single elemement in vocabulary
from sklearn.feature_extraction.text import CountVectorizer
food_names = ["Air-dried meat", "Almonds", "Amaranth"]
vocabulary = [x.lower() for x in food_names]
count_vect = CountVectorizer(binary=True, vocabulary=vocabulary)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
Problem is when you want to use these methods with transform()
because only tokenizer=shlex.split
splits text in transformed text. But it also need " "
in text to catch Air-dried meat
from sklearn.feature_extraction.text import CountVectorizer
import shlex
food_names = ['"Air-dried meat" Almonds Amaranth']
count_vect = CountVectorizer(binary=True, tokenizer=shlex.split)
bow_rep = count_vect.fit(food_names)
print("Our vocabulary:", count_vect.vocabulary_)
text = 'Almonds of Germany'
temp = count_vect.transform([text])
print(text, temp.toarray())
text = '"Air-dried meat"'
temp = count_vect.transform([text])
print(text, temp.toarray())
text = 'Air-dried meat'
temp = count_vect.transform([text])
print(text, temp.toarray())