Search code examples
pythonpython-3.xdictionarydefaultdict

Nesting Defaultdictionaries in Python3


I'm working on a piece of code and this is my first time implementing defaultdict from collections. Currently, I have a piece of code that works just fine as a defaultdict, but I'd really like to nest my dictionaries.

This is the code I have currently:

from collections import defaultdict, Counter
from re import findall

class Class: 
    def __init__(self, n, file):
        self.counts = defaultdict(lambda: defaultdict(int))
        self.__n = n
        self.__file = file

    def function(self, starting_text, charas):
        self.__starting_text = starting_text
        self.__charas = charas

        with open(self.__file) as file:
            text = file.read().lower().replace('\n', ' ')

        ngrams = [text[i : i + self.__n] for i in range(len(text))]

        out = self.counts
        for item in ngrams:
            data = []
            for word in findall( item+".", text):
                data.append(word[-1])
            self.counts = { item : data.count(item) for item in data }
            out[item] = self.counts
        self.counts = out

Some of the things in the code aren't yet implemented because I'm at a bit of a standstill, so please ignore anything that isn't applicable to this particular question!

If I run print(self.counts) at the end, my program runs something that looks like this:

defaultdict(<function Class.__init__.<locals>.<lambda> at 0x7f8c4a94bea0>, {'t': {'h': 1, ' ': 1, 'e': 1}, 'h': {'i': 1, 'o': 1}, 'i': {'s': 2}, 's': {' ': 2, 'h': 1, 'e': 1}, ' ': {'i': 1, 'a': 1, 's': 2}, 'a': {' ': 1}, 'o': {'r': 1}, 'r': {'t': 1}, 'e': {'n': 2, ' ': 1}, 'n': {'t': 1, 'c': 1}, 'c': {'e': 1}})

Which is great! But I'd really like to have those inner dictionaries be defaultdicts as well. In particular, if I run self.counts['t']['h'], I get 1, as expected. However, a benefit of the defaultdict is having it give you 0 if a key is not available. Currently, if I run self.counts['t']['x'] I get a keyerror, but I'd live to get 0 instead by having each inner list be a defaultdict as well.

I'm assuming this can be done somewhere in the chunk of code beginning with out=self.counts, but I'm a little unsure how I can achieve this.


Solution

    1. make data a collections.Counter or a defaultdict instead of a list (since you strictly don't care about the sequence of bits, only their occurrences count)
    2. then update your self.counts[item] with the counter instead of assigning a dict
    3. you could even do the update straight:
      for item in ngrams:
          data = self.counts[item]
          for word in findall( item+".", text):
              data[word[-1]] += 1
      
      and that's about it, this will update the relevant counts straight into the defaultdict as originally defined

    That aside much of the code is... not ideal, or odd

    seems unnecessarily complex

    You're getting each codepoint following your ngram, why not straight extract pseudo-ngrams of n+1 and split that up? Something along the lines of (untested, may be slightly off):

    for i in range(0, len(text)-n):
        ngram, follower = text[i:i+n], text[i+n]
        self.counts[ngram][follower] += 1
    

    That also avoids the at-least quadratic complexity of your code (and the various constant complexities) which is a nice side-effect, though note that the original would implicitly skip followers of \n (a newline / line break) as wihout re.DOTALL, . "matches any character except a newline". So if you want to keep that behaviour you'll have to specifically test for and skip on follower == '\n'.

    Reusing member variables as locals?

    You're reusing self.counts as a local variable for some weird reason, saving it to out, setting it to weird stuff then re-loading it after having set it on itself, why isn't out the inner variable?

            for item in ngrams:
                data = []
                for word in findall( item+".", text):
                    data.append(word[-1])
                out = { item : data.count(item) for item in data }
                self.counts[item] = out
    

    Not that that's very useful (possibly aside from printf-debugging utility), you can assign to self.counts[item] straight.

    I also have no idea whatsoever what utility __starting_text and __charas have

    double underscore prefixes

    Don't. It doesn't matter what you're using them for, I'm reasonably sure you're wrong (because I've rarely encountered people who knew what these are for) and you should stop it.

    If you want to hint to callers that something is an internal detail of the object, use a single underscore prefix. Though you probably don't need to do that either.

    always pass an encoding to text-mode open

    Seriously. open(path) works in text mode (automatically decodes the raw on-disk data to str), but the encoding it picks is whatever getdefaultencoding() returns which is as likely as not to be garbage. You don't want to use it to read the user's file, and you really absolutely never want to use it to read your own file. Explicitly provide encoding='utf-8', it will avoid lots of grief down the line. If you need to infer encoding maybe use chardet.