Search code examples

Overridden __setitem__ call works in serial but breaks in apply_async call

I've been fighting with this problem for some time now and I've finally managed to narrow down the issue and create a minimum working example.

The summary of the problem is that I have a class that inherits from a dict to facilitate parsing of misc. input files. I've overridden the the __setitem__ call to support recursive indexing of sections in our input file (e.g. parser['some.section.variable'] is equivalent to parser['some']['section']['variable']). This has been working great for us for over a year now, but we just ran into an issue when passing these Parser classes through a multiprocessing.apply_async call.

Show below is the minimum working example - obviously the __setitem__ call isn't doing anything special, but it's important that it accesses some class attribute like self.section_delimiter - this is where it breaks. It doesn't break in the initial call or in the serial function call. But when you call the some_function (which doesn't do anything either) using apply_async, it crashes.

import multiprocessing as mp
import numpy as np

class Parser(dict):

    def __init__(self, file_name : str = None):
        self.section_delimiter = "."
    def __setitem__(self, key, value):
        dict.__setitem__(self, key, value)
def some_function(parser):

if __name__ == "__main__":

    print("Initialize creation/setting")
    parser = Parser()
    parser['x'] = 1

    print("Single serial call works fine")

    print("Parallel async call breaks on line 16?")
    pool = mp.Pool(1)
    for i in range(1):
        pool.apply_async(some_function, (parser,))


If you run the code below, you'll get the following output

Initialize creation/setting
Single serial call works fine
Parallel async call breaks on line 16?
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/", line 297, in _bootstrap
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/", line 110, in worker
    task = get()
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/", line 354, in get
    return _ForkingPickler.loads(res)
  File "", line 13, in __setitem__
AttributeError: 'Parser' object has no attribute 'section_delimiter'

Any help is greatly appreciated. I spent considerable time tracking down this bug and reproducing a minimal example. I would love to not only fix it, but clearly fill some gap in my understanding on how these apply_async and inheritance/overridden methods interact.

Let me know if you need any more information.

Thank you very much!



  • Cause

    The cause of the problem is that multiprocessing serializes and deserializes your Parser object to move its data across process boundaries. This is done using pickle. By default pickle does not call __init__() when deserializing classes. Because of this self.section_delimiter is not set when the deserializer calls __setitem__() to restore the items in your dictionary and you get the error:

    AttributeError: 'Parser' object has no attribute 'section_delimiter'

    Using just pickle and no multiprocessing gives the same error:

    import pickle
    parser = Parser()
    parser['x'] = 1
    data = pickle.dumps(parser)
    copy = pickle.loads(data) # Same AttributeError here

    Deserialization will work for an object with no items and the value of section_delimiter will be restored:

    import pickle
    parser = Parser()
    parser.section_delimiter = "|"
    data = pickle.dumps(parser)
    copy = pickle.loads(data)
    print(copy.section_delimiter) # Prints "|"

    So in a sense you are just unlucky that pickle calls __setitem__() before it restores the rest of the state of your Parser.


    You can work around this by setting section_delimiter in __new__() and telling pickle what arguments to pass to __new__() by implementing __getnewargs__():

    def __new__(cls, *args):
        self = super(Parser, cls).__new__(cls)
        self.section_delimiter = args[0] if args else "."
        return self
    def __getnewargs__(self):
        return (self.section_delimiter,)

    __getnewargs__() returns a tuple of arguments. Because section_delimiter is set in __new__(), it is no longer necessary to set it in __init__().

    This is the code of your Parser class after the change:

    class Parser(dict):
        def __init__(self, file_name : str = None):
        def __new__(cls, *args):
            self = super(Parser, cls).__new__(cls)
            self.section_delimiter = args[0] if args else "."
            return self
        def __getnewargs__(self):
            return (self.section_delimiter,)
        def __setitem__(self, key, value):
            dict.__setitem__(self, key, value)

    Simpler solution

    The reason pickle calls __setitem__() on your Parser object is because it is a dictionary. If your Parser is just a class that happens to implement __setitem__() and __getitem__() and has a dictionary to implement those calls then pickle will not call __setitem__() and serialization will work with no extra code:

    class Parser:
        def __init__(self, file_name : str = None):
            self.dict = { }
            self.section_delimiter = "."
        def __setitem__(self, key, value):
            self.dict[key] = value
        def __getitem__(self, key):
            return self.dict[key]

    So if there is no other reason for your Parser to be a dictionary, I would just not use inheritance here.