Text read from file and cleaned up:
['the', 'cat', 'chased', 'the', 'dog', 'fled']
The challenge is to return a dict with each word as the value and the words that can follow it as the key and a count for the number of times it follows it:
{'the': {'cat': 1, 'dog': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}, 'dog': {'fled': 1}}
Collections.counter will count the frequency of each unique value. However, my algorithm to solve this challenge is long and unwieldy. How might defaultdict be used to make solving this more simple?
EDIT: here is my code to bruise through this problem. A flaw is that the values in the nested dict are the total number of times a word appears in the text, not how many times it actually follows that particular word.
from collections import Counter, defaultdict
wordsFile = f.read()
words = [x.strip(string.punctuation).lower() for x in wordsFile.split()]
counter = Counter(words)
# the dict of [unique word]:[index of appearance in 'words']
index = defaultdict(list)
# Appends every position of 'term' to the 'term' key
for pos, term in enumerate(words):
index[term].append(pos)
# range ends at len(index) - 2 because last word in text has no follower
master = {}
for i in range(0,(len(index)-2)):
# z will hold the [index of appearance in 'words'] values
z = []
z = index.values()[i]
try:
# Because I am interested in follower words
z = [words[a+1] for a in z]
print z; print
# To avoid value errors if a+1 exceeds range of list
except Exception:
pass
# For each word, build r into the dict that contains each follower word and its frequency.
r = {}
for key in z:
r.update({key: counter[key]})
master.update({index.keys()[i]:r})
return master
Using defaultdict
:
import collections
words = ['the', 'cat','chased', 'the', 'dog', 'fled']
result = collections.defaultdict(dict)
for i in range(len(words) - 1): # loop till second to last word
occurs = result[words[i]] # get the dict containing the words that follow and their freqs
new_freq = occurs.get(words[i+1], 0) + 1 # update the freqs
occurs[words[i+1]] = new_freq
print list(result.items())