Search code examples
pythondictionaryautovivification

In python, how does the following AutoVivification class work?


In searching for a way of working with nested dictionaries, I found the following code posted by nosklo, which I would like to have explained, please.

class AutoVivification(dict):
    """Implementation of perl's autovivification feature."""
    def __getitem__(self, item):
        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value

Testing:

a = AutoVivification()

a[1][2][3] = 4
a[1][3][3] = 5
a[1][2]['test'] = 6

print a

Output:

{1: {2: {'test': 6, 3: 4}, 3: {3: 5}}}

I'm a pretty newbie programmer. I have learned most of what I know on my own time on the side, with my only formal training being on Turbo Pascal back in high school. I understand and am able to use classes in simple ways, such as using __init__, class methods, and storing data within instances of the class with foo.man = 'choo'.

I have no idea how the series of square brackets get directed, correctly, through the class (I presume they are calling __getitem__ somehow) and don't understand how they each get handled so concisely without having to call the method three times individually.

I was under the impression that the (dict) in the class declaration would be handled by an __init__.

I've used try: except: before, though again, in quite simple ways. It looks to me like the try, when it runs, is calling a series of function __getitem__. I gather that if the current level's dictionary exists, the try will pass and go to the next dictionary. The except, I gather, runs when there's a KeyError but I haven't seen self used like that before. Self's being treated like a dictionary while I thought self was an instance of class AutoVivification... is it both? I have never assigned twice in a row like this foo = man = choo but suspect that value is pointing to self[item] while self[item] points to the result of type(self). But type(self) would return something like this: <class '__main__.AutoVivification'> wouldn't it? I have no idea what the extra round brackets at the end there are for. Because I don't know how the function is being called, I don't understand where value is being returned.

Sorry for all the questions! There is so much in this that I don't understand and I don't know where to look it up short of reading through the documentation for hours in which I'd retain very little. This code looks like it'll serve my purposes but I want to understand it before using it.

In case you want to know what I'm trying to do in my program with nested dictionaries: I'm trying to hold map data on an astronomical scale. While I can't create dictionaries/lists of 10^6 items nested 4 times (that would be 10^24 items!), the space is mostly empty so I can leave the empty values out completely and only assign when there's something there. What was stumping me was an efficient way of handling the dictionaries.


Solution

  • Line by line:

    class AutoVivification(dict):
    

    We make a subclass of dict, so AutoVivification is a kind of dict, with some local changes.

    def __getitem__(self, item):
    

    The __getitem()__ hook is called whenever someone tries to access an item on the instance through [...] index lookups. So whenever someone does object[somekey], type(object).__getitem__(object, somekey) is called.

    We'll skip the try for a moment, next line is:

     return dict.__getitem__(self, item)
    

    This calls the unbound method __getitem__(), and passes in our own instance to it, together with the key. In other words, we call the original __getitem__ as defined by our parent class dict.

    Now, we all know what happens if there is no item key in a dictionary, a KeyError is raised. This is where the try:, except KeyError combo comes in:

        try:
            return dict.__getitem__(self, item)
        except KeyError:
            value = self[item] = type(self)()
            return value
    

    So, if the current instance (which is a sub-type of dict) doesn't have a given key, it'll catch the KeyError exception the original dict.__getitem__() method throws, and instead we create a new value, store that in self[item] and return that value.

    Now, remember that self is a (subclass) of dict, so it's a dictionary. It thus can assign new values (for which it'll use the __setitem__ hook, incidentially), and in this case it creates a new instance of the same type as self. That's another dict subclass.

    So what happens in detail when we call a[1][2][3] = 4? Python goes through this step by step:

    1. a[1] leads to type(a).__getitem__(a, 1). The custom __getitem__ method of AutoVivification catches the KeyError, creates a new instance of AutoVivification, stores that under the key 1 and returns it.

    2. a[1] returned an empty AutoVivification instance. The next item access [2] is called on that object, and we repeat what happened in step 1; there is a KeyError, a new instance of AutoVivification is created, stored under the 2 key, and that new instance is returned to the caller.

    3. a[1][2] returned an empty AutoVivification instance. The next item access [3] is called on that object, and we repeat what happened in step 1 (and in step 2). There is a KeyError, a new instance of AutoVivification is created, stored under the 3 key, and that new instance is returned to the caller.

    4. a[1][2][3] returned an empty AutoVivification instance. Now we store a new value in that instance, 4.

    Once you go to your next line of code, a[1][3][3] = 5, the top-level AutoVivification instance already has a 1 key, and the return dict.__getitem__(self, item) line will return the corresponding value, which happens to be the AutoVivification instance created in step one above.

    From there, the [3] item access call will create a new AutoVivification instance again (because the object at a[1] only has a 2 key), and we go through all the same steps again.