I'm working on a piece of code and this is my first time implementing defaultdict
from collections
.
Currently, I have a piece of code that works just fine as a defaultdict
, but I'd really like to nest my dictionaries.
This is the code I have currently:
from collections import defaultdict, Counter
from re import findall
class Class:
def __init__(self, n, file):
self.counts = defaultdict(lambda: defaultdict(int))
self.__n = n
self.__file = file
def function(self, starting_text, charas):
self.__starting_text = starting_text
self.__charas = charas
with open(self.__file) as file:
text = file.read().lower().replace('\n', ' ')
ngrams = [text[i : i + self.__n] for i in range(len(text))]
out = self.counts
for item in ngrams:
data = []
for word in findall( item+".", text):
data.append(word[-1])
self.counts = { item : data.count(item) for item in data }
out[item] = self.counts
self.counts = out
Some of the things in the code aren't yet implemented because I'm at a bit of a standstill, so please ignore anything that isn't applicable to this particular question!
If I run print(self.counts)
at the end, my program runs something that looks like this:
defaultdict(<function Class.__init__.<locals>.<lambda> at 0x7f8c4a94bea0>, {'t': {'h': 1, ' ': 1, 'e': 1}, 'h': {'i': 1, 'o': 1}, 'i': {'s': 2}, 's': {' ': 2, 'h': 1, 'e': 1}, ' ': {'i': 1, 'a': 1, 's': 2}, 'a': {' ': 1}, 'o': {'r': 1}, 'r': {'t': 1}, 'e': {'n': 2, ' ': 1}, 'n': {'t': 1, 'c': 1}, 'c': {'e': 1}})
Which is great! But I'd really like to have those inner dictionaries be defaultdicts as well. In particular, if I run self.counts['t']['h']
, I get 1
, as expected.
However, a benefit of the defaultdict is having it give you 0
if a key is not available. Currently, if I run self.counts['t']['x']
I get a keyerror, but I'd live to get 0
instead by having each inner list be a defaultdict as well.
I'm assuming this can be done somewhere in the chunk of code beginning with out=self.counts
, but I'm a little unsure how I can achieve this.
collections.Counter
or a defaultdict
instead of a list (since you strictly don't care about the sequence of bits, only their occurrences count)update
your self.counts[item]
with the counter instead of assigning a dictfor item in ngrams:
data = self.counts[item]
for word in findall( item+".", text):
data[word[-1]] += 1
and that's about it, this will update the relevant counts straight into the defaultdict as originally definedThat aside much of the code is... not ideal, or odd
You're getting each codepoint following your ngram, why not straight extract pseudo-ngrams of n+1 and split that up? Something along the lines of (untested, may be slightly off):
for i in range(0, len(text)-n):
ngram, follower = text[i:i+n], text[i+n]
self.counts[ngram][follower] += 1
That also avoids the at-least quadratic complexity of your code (and the various constant complexities) which is a nice side-effect, though note that the original would implicitly skip followers of \n
(a newline / line break) as wihout re.DOTALL
, .
"matches any character except a newline". So if you want to keep that behaviour you'll have to specifically test for and skip on follower == '\n'
.
You're reusing self.counts
as a local variable for some weird reason, saving it to out
, setting it to weird stuff then re-loading it after having set it on itself, why isn't out
the inner variable?
for item in ngrams:
data = []
for word in findall( item+".", text):
data.append(word[-1])
out = { item : data.count(item) for item in data }
self.counts[item] = out
Not that that's very useful (possibly aside from printf-debugging utility), you can assign to self.counts[item]
straight.
I also have no idea whatsoever what utility __starting_text
and __charas
have
Don't. It doesn't matter what you're using them for, I'm reasonably sure you're wrong (because I've rarely encountered people who knew what these are for) and you should stop it.
If you want to hint to callers that something is an internal detail of the object, use a single underscore prefix. Though you probably don't need to do that either.
open
Seriously. open(path)
works in text mode (automatically decodes the raw on-disk data to str
), but the encoding it picks is whatever getdefaultencoding()
returns which is as likely as not to be garbage. You don't want to use it to read the user's file, and you really absolutely never want to use it to read your own file. Explicitly provide encoding='utf-8'
, it will avoid lots of grief down the line. If you need to infer encoding maybe use chardet.