final_vocab = {'Amazon',
'Big Bazaar',
'Brand Factory',
'Central',
'Cleartrip',
'Dominos',
'Flipkart',
'IRCTC',
'Lenskart',
'Lifestyle',
'MAX',
'MMT',
'More',
'Myntra'}
vect = CountVectorizer(vocabulary=final_vocab)
token_df = pd.DataFrame(vect.fit_transform(['Big Bazaar','Brand Factory']).todense(), columns=vect.get_feature_names())
why all output is zero ??? for Big Bazaar and brand factory values should come 1 ???
Your CountVectorizer
is missing 2 things:
ngram_range=(2,2)
as stated in the docs: All values of n such such that min_n <= n <= max_n will be used
. This help CountVectorizer
get 2 gram vector from the input (Big Bazaar
instead of ['Big','Bazaar']
)lowercase=False
which means: Convert all characters to lowercase before tokenizing
. This will make Big Bazaar
and Brand Factory
became lower case and thus can't be found in vocabulary. Setting to False will prevent that from happening.Also, because you've provided a vocabulary to CountVectorizer
, use transform
instead of fit_transform
from sklearn.feature_extraction.text import CountVectorizer
final_vocab = ['Amazon',
'Big Bazaar',
'Brand Factory',
'Central',
'Cleartrip',
'Dominos',
'Flipkart',
'IRCTC',
'Lenskart',
'Lifestyle',
'MAX',
'MMT',
'More',
'Myntra']
vect = CountVectorizer(vocabulary=final_vocab, ngram_range=(2, 2), lowercase=False)
token_df = pd.DataFrame(vect.transform(['Big Bazaar','Brand Factory']).todense(), columns=vect.get_feature_names())