Search code examples
pythonnltkfrequency-distribution

Python Frequency Distribution (FreqDist / NLTK) Issue


I'm attempting to break a list of words (a tokenized string) into each possible substring. I'd then like to run a FreqDist on each substring, to find the most common substring. The first part works fine. However, when I run the FreqDist, I get the error:

TypeError: unhashable type: 'list'

Here is my code:

import nltk

string = ['This','is','a','sample']
substrings = []

count1 = 0
count2 = 0

for word in string:
    while count2 <= len(string):
        if count1 != count2:
            temp = string[count1:count2]
            substrings.append(temp)
        count2 += 1
    count1 +=1
    count2 = count1

print substrings

fd = nltk.FreqDist(substrings)

print fd

The output of substrings is fine. Here it is:

[['This'], ['This', 'is'], ['This', 'is', 'a'], ['This', 'is', 'a', 'sample'], ['is'], ['is', 'a'], ['is', 'a', 'sample'], ['a'], ['a', 'sample'], ['sample']]

However, I just can't get the FreqDist to run on it. Any insight would be greatly appreciated. In this case, each substring would only have a FreqDist of 1, but this program is meant to be run on a much larger sample of text.


Solution

  • I'm not completely certain what you want, but the error message is saying that it wants to hash the list, which is usually a sign it's putting it in a set or using it as a dictionary key. We can get around this by giving it tuples instead.

    >>> import nltk
    >>> import itertools
    >>> 
    >>> sentence = ['This','is','a','sample']
    >>> contiguous_subs = [sentence[i:j] for i,j in itertools.combinations(xrange(len(sentence)+1), 2)]
    >>> contiguous_subs
    [['This'], ['This', 'is'], ['This', 'is', 'a'], ['This', 'is', 'a', 'sample'],
     ['is'], ['is', 'a'], ['is', 'a', 'sample'], ['a'], ['a', 'sample'],
     ['sample']]
    

    but we still have

    >>> fd = nltk.FreqDist(contiguous_subs)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 107, in __init__
        self.update(samples)
      File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 437, in update
        self.inc(sample, count=count)
      File "/usr/local/lib/python2.7/dist-packages/nltk/probability.py", line 122, in inc
        self[sample] = self.get(sample,0) + count
    TypeError: unhashable type: 'list'
    

    If we make the subsequences into tuples, though:

    >>> contiguous_subs = [tuple(sentence[i:j]) for i,j in itertools.combinations(xrange(len(sentence)+1), 2)]
    >>> contiguous_subs
    [('This',), ('This', 'is'), ('This', 'is', 'a'), ('This', 'is', 'a', 'sample'), ('is',), ('is', 'a'), ('is', 'a', 'sample'), ('a',), ('a', 'sample'), ('sample',)]
    >>> fd = nltk.FreqDist(contiguous_subs)
    >>> print fd
    <FreqDist: ('This',): 1, ('This', 'is'): 1, ('This', 'is', 'a'): 1, ('This', 'is', 'a', 'sample'): 1, ('a',): 1, ('a', 'sample'): 1, ('is',): 1, ('is', 'a'): 1, ('is', 'a', 'sample'): 1, ('sample',): 1>
    

    Is that what you're looking for?