Search code examples
pythonnlpcountvectorizer

countvectorizer not able to detect , words


final_vocab = {'Amazon',
'Big Bazaar',
'Brand Factory',
'Central',
'Cleartrip',
'Dominos',
'Flipkart',
'IRCTC',
'Lenskart',
'Lifestyle',
'MAX',
'MMT',
'More',
'Myntra'}
 
vect = CountVectorizer(vocabulary=final_vocab)
token_df = pd.DataFrame(vect.fit_transform(['Big Bazaar','Brand Factory']).todense(), columns=vect.get_feature_names())

enter image description here

why all output is zero ??? for Big Bazaar and brand factory values should come 1 ???


Solution

  • Your CountVectorizer is missing 2 things:

    1. ngram_range=(2,2) as stated in the docs: All values of n such such that min_n <= n <= max_n will be used. This help CountVectorizer get 2 gram vector from the input (Big Bazaar instead of ['Big','Bazaar'])
    2. lowercase=False which means: Convert all characters to lowercase before tokenizing. This will make Big Bazaar and Brand Factory became lower case and thus can't be found in vocabulary. Setting to False will prevent that from happening.

    Also, because you've provided a vocabulary to CountVectorizer, use transform instead of fit_transform

    from sklearn.feature_extraction.text import CountVectorizer
    
    final_vocab = ['Amazon',
    'Big Bazaar',
    'Brand Factory',
    'Central',
    'Cleartrip',
    'Dominos',
    'Flipkart',
    'IRCTC',
    'Lenskart',
    'Lifestyle',
    'MAX',
    'MMT',
    'More',
    'Myntra']
     
    vect = CountVectorizer(vocabulary=final_vocab, ngram_range=(2, 2), lowercase=False)
    token_df = pd.DataFrame(vect.transform(['Big Bazaar','Brand Factory']).todense(), columns=vect.get_feature_names())