Search code examples
python-2.7nltkfrequency-distribution

FreqDist with nltk: ValueError: too many values to unpack


I have been trying to find the frequency distribution of nouns in a given sentence. If I do this:

text = "This ball is blue, small and extraordinary. Like no other ball."
text=text.lower()
token_text= nltk.word_tokenize(text)
tagged_sent = nltk.pos_tag(token_text)
nouns= []
for word,pos in tagged_sent:
    if pos in ['NN',"NNP","NNS"]:
        nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns

It considers "ball" and "ball." as separate words. So I went ahead and tokenized the sentence before tokenizing the words:

text = "This ball is blue, small and extraordinary. Like no other ball."
text=text.lower()
sentences = nltk.sent_tokenize(text)                        
words = [nltk.word_tokenize(sent)for sent in sentences]    
tagged_sent = [nltk.pos_tag(sent)for sent in words]
nouns= []
for word,pos in tagged_sent:
    if pos in ['NN',"NNP","NNS"]:
        nouns.append(word)
freq_nouns=nltk.FreqDist(nouns)
print freq_nouns

It gives the following error:

Traceback (most recent call last):
File "C:\beautifulsoup4-4.3.2\Trial.py", line 19, in <module>
for word,pos in tagged_sent:
ValueError: too many values to unpack

What am I doing wrong? Please help.


Solution

  • You were so close!

    In this case, you changed your tagged_sent from a list of tuples to a list of lists of tuples because of your list comprehension tagged_sent = [nltk.pos_tag(sent)for sent in words].

    Here's some things you can do to discover what type of objects you have:

    >>> type(tagged_sent), len(tagged_sent)
    (<type 'list'>, 2)
    

    This shows you that you have a list; in this case of 2 sentences. You can further inspect one of those sentences like this:

    >>> type(tagged_sent[0]), len(tagged_sent[0])
    (<type 'list'>, 9)
    

    You can see that the first sentence is another list, containing 9 items. Well, what does one of those items look like? Well, lets look at the first item of the first list:

    >>> tagged_sent[0][0]
    ('this', 'DT')
    

    If your curious to see the entire object, which I frequently am, you can ask the pprint (pretty-print) module to make it nicer to look at like this:

    >>> from pprint import pprint
    >>> pprint(tagged_sent)
    [[('this', 'DT'),
      ('ball', 'NN'),
      ('is', 'VBZ'),
      ('blue', 'JJ'),
      (',', ','),
      ('small', 'JJ'),
      ('and', 'CC'),
      ('extraordinary', 'JJ'),
      ('.', '.')],
     [('like', 'IN'), ('no', 'DT'), ('other', 'JJ'), ('ball', 'NN'), ('.', '.')]]
    

    So, the long answer is your code needs to iterate over the new second layer of lists, like this:

    nouns= []
    for sentence in tagged_sent:
        for word,pos in sentence:
            if pos in ['NN',"NNP","NNS"]:
                nouns.append(word)
    

    Of course, this just returns a non-unique list of items, which look like this:

    >>> nouns
    ['ball', 'ball']
    

    You can unique-ify this list in many different ways, but you can quickly by using the set() data structure, like so:

    unique_nouns = list(set(nouns))
    >>> print unique_nouns
    ['ball']
    

    For an examination of other ways you can unique-ify a list of items, see the slightly older but extremely useful: http://www.peterbe.com/plog/uniqifiers-benchmark