Search code examples
defaultdict

Defaultdict() the correct choice?


EDIT: mistake fixed

The idea is to read text from a file, clean it, and pair consecutive words (not permuations):

file = f.read()
words = [word.strip(string.punctuation).lower() for word in file.split()]
pairs = [(words[i]+" " + words[i+1]).split() for i in range(len(words)-1)]

Then, for each pair, create a list of all the possible individual words that can follow that pair throughout the text. The dict will look like

[ConsecWordPair]:[listOfFollowers]

Thus, referencing the dictionary for a given pair will return all of the words that can follow that pair. E.g.

wordsThatFollow[('she', 'was')]
>> ['alone', 'happy', 'not']

My algorithm to achieve this involves a defaultdict(list)...

wordsThatFollow = defaultdict(list) 

for i in range(len(words)-1):
    try:
        # pairs overlap, want second word of next pair
        # wordsThatFollow[tuple(pairs[i])] = pairs[i+1][1]
        EDIT: wordsThatFollow[tuple(pairs[i])].update(pairs[i+1][1][0]
    except Exception:
        pass

I'm not so worried about the value error I have to circumvent with the 'try-except' (unless I should be). The problem is that the algorithm only successfully returns one of the followers:

wordsThatFollow[('she', 'was')]
>> ['not']

Sorry if this post is bad for the community I'm figuring things out as I go ^^


Solution

  • Your problem is that you are always overwriting the value, when you really want to extend it:

    # Instead of this
    wordsThatFollow[tuple(pairs[i])] = pairs[i+1][1]
    
    # Do this
    wordsThatFollow[tuple(pairs[i])].append(pairs[i+1][1])