countvectorizer not able to detect , words

final_vocab = {'Amazon',
'Big Bazaar',
'Brand Factory',
'Central',
'Cleartrip',
'Dominos',
'Flipkart',
'IRCTC',
'Lenskart',
'Lifestyle',
'MAX',
'MMT',
'More',
'Myntra'}
 
vect = CountVectorizer(vocabulary=final_vocab)
token_df = pd.DataFrame(vect.fit_transform(['Big Bazaar','Brand Factory']).todense(), columns=vect.get_feature_names())

why all output is zero ??? for Big Bazaar and brand factory values should come 1 ???

Solution

Your CountVectorizer is missing 2 things:

ngram_range=(2,2) as stated in the docs: All values of n such such that min_n <= n <= max_n will be used. This help CountVectorizer get 2 gram vector from the input (Big Bazaar instead of ['Big','Bazaar'])
lowercase=False which means: Convert all characters to lowercase before tokenizing. This will make Big Bazaar and Brand Factory became lower case and thus can't be found in vocabulary. Setting to False will prevent that from happening.

Also, because you've provided a vocabulary to CountVectorizer, use transform instead of fit_transform

from sklearn.feature_extraction.text import CountVectorizer

final_vocab = ['Amazon',
'Big Bazaar',
'Brand Factory',
'Central',
'Cleartrip',
'Dominos',
'Flipkart',
'IRCTC',
'Lenskart',
'Lifestyle',
'MAX',
'MMT',
'More',
'Myntra']
 
vect = CountVectorizer(vocabulary=final_vocab, ngram_range=(2, 2), lowercase=False)
token_df = pd.DataFrame(vect.transform(['Big Bazaar','Brand Factory']).todense(), columns=vect.get_feature_names())