Search code examples
pythonpython-2.7collectionsdefaultdict

Defaultdict(defaultdict) for text analysis


Text read from file and cleaned up:

['the', 'cat', 'chased', 'the', 'dog', 'fled']

The challenge is to return a dict with each word as the value and the words that can follow it as the key and a count for the number of times it follows it:

{'the': {'cat': 1, 'dog': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}, 'dog': {'fled': 1}}

Collections.counter will count the frequency of each unique value. However, my algorithm to solve this challenge is long and unwieldy. How might defaultdict be used to make solving this more simple?

EDIT: here is my code to bruise through this problem. A flaw is that the values in the nested dict are the total number of times a word appears in the text, not how many times it actually follows that particular word.

from collections import Counter, defaultdict

wordsFile = f.read()
words = [x.strip(string.punctuation).lower() for x in wordsFile.split()]    
counter = Counter(words)

# the dict of [unique word]:[index of appearance in 'words']
index = defaultdict(list) 

# Appends every position of 'term' to the 'term' key
for pos, term in enumerate(words):
    index[term].append(pos)  

# range ends at len(index) - 2 because last word in text has no follower
master = {}
for i in range(0,(len(index)-2)):

    # z will hold the [index of appearance in 'words'] values
    z = []
    z = index.values()[i] 
    try:

        # Because I am interested in follower words
        z = [words[a+1] for a in z]
        print z; print

        # To avoid value errors if a+1 exceeds range of list
    except Exception:
        pass

    # For each word, build r into the dict that contains each follower word and its frequency.

    r = {}
    for key in z:
        r.update({key: counter[key]})

    master.update({index.keys()[i]:r})


return  master

Solution

  • Using defaultdict:

    import collections
    
    words = ['the', 'cat','chased', 'the', 'dog', 'fled']
    result = collections.defaultdict(dict)
    
    for i in range(len(words) - 1):   # loop till second to last word
        occurs = result[words[i]]    # get the dict containing the words that follow and their freqs
        new_freq = occurs.get(words[i+1], 0) + 1  # update the freqs
        occurs[words[i+1]] = new_freq
    
    print list(result.items())