Search code examples
pythonpandasscikit-learnnltkcountvectorizer

Create word dictionary for sentences in a list


I have a list of sentences

a = [['i am a testing'],['we are working on project']]

I am trying to create a word dictionary for all the sentences on the list. I tried

vectorizer = CountVectorizer()
vectorizer.fit_transform(a)
coffee_dict2 = vectorizer.vocabulary_

And i am getting an error AttributeError: 'list' object has no attribute 'lower'

The result i am expecting is a dictionary

{'i': 1, 'am': 1, 'testing': 2}


Solution

  • You need flatten nested lists:

    from sklearn.feature_extraction.text import CountVectorizer
    coffee_reviews_test = [['i am a testing'],['we are working on project']]
    
    from  itertools import chain
    
    vectorizer = CountVectorizer()
    vectorizer.fit_transform(chain.from_iterable(coffee_reviews_test))
    

    Another solution:

    vectorizer.fit_transform([x for y in coffee_reviews_test for x in y])
    

    coffee_dict2 = vectorizer.vocabulary_
    print (coffee_dict2)
    {'am': 0, 'testing': 4, 'we': 5, 'are': 1, 'working': 6, 'on': 2, 'project': 3}