Search code examples
pythonpython-3.xdictionaryn-gramdefaultdict

Python3: Counting occurrence of characters in a nested dict


I'm currently working on a small piece of code and I seem to have run into a roadblock. I was wondering if it's possible to find the most common occurrence of a character that follows a specific group of characters?

For example, say I have the following sentence:

"fishies are super neat, a fish is a good pet. also, fishing is for dads."

How would could I determine, for example, the most common character that occurs after the fragment "fish"?

In this specific example, doing it by hand, I get something like this:

{"i": 2, " ": 1}

Currently, I have this chunk of code written to grab the "fish" portion of the word:

b = Class(n, 'file.txt')
ngrams = [b.file[i:i+n] for i in range(len(b.file)-1)]

this will break up all of the text into chunks of 4 like so: ['fish', 'ishi', 'shie', 'hies', 'ies ', 'es a'.....]

My goal is to combine these two thoughts so that I can print something that looks like the following:

{'fish' : {'i':2, ' ':1} ..... }

I also currently have a defaultdict defined in __init__ like so: self.counts=defaultdict(lambda: defaultdict(int))

This is the closest I can get to achieving my desired solution, although I am unsure how to grab the individual characters that follow and how to count those characters:

b.counts = {i : { j : 5 for j in ngrams } for i in ngrams }

5 is merely a placeholder so I could see what printed. j in ngrams was also a placeholder to see what printed. Any input or ideas from anyone would be greatly appreciated!


Solution

  • import re
    
    raw_string = "fishies are super neat, a fish is a good pet. also, fishing is for dads."
    
    key =  ['fish', 'ishi', 'shie', 'hies']
    out = {}
    for item in key:
        data = []
        for word in re.findall(item+'.', raw_string):
            data.append(word[-1])
        results = {item:data.count(item) for item in data}
        out[item] = results
    

    Output:

    {'fish': {'i': 2, ' ': 1}, 'ishi': {'e': 1, 'n': 1}, 'shie': {'s': 1}, 'hies': {' ': 1}}