Search code examples
python-3.xdictionaryoverridingpython-multiprocessingapply-async

Overridden __setitem__ call works in serial but breaks in apply_async call


I've been fighting with this problem for some time now and I've finally managed to narrow down the issue and create a minimum working example.

The summary of the problem is that I have a class that inherits from a dict to facilitate parsing of misc. input files. I've overridden the the __setitem__ call to support recursive indexing of sections in our input file (e.g. parser['some.section.variable'] is equivalent to parser['some']['section']['variable']). This has been working great for us for over a year now, but we just ran into an issue when passing these Parser classes through a multiprocessing.apply_async call.

Show below is the minimum working example - obviously the __setitem__ call isn't doing anything special, but it's important that it accesses some class attribute like self.section_delimiter - this is where it breaks. It doesn't break in the initial call or in the serial function call. But when you call the some_function (which doesn't do anything either) using apply_async, it crashes.

import multiprocessing as mp
import numpy as np

class Parser(dict):

    def __init__(self, file_name : str = None):
        print('\t__init__')
        super().__init__()
        self.section_delimiter = "."
    
    def __setitem__(self, key, value):
        print('\t__setitem__')
        self.section_delimiter
        dict.__setitem__(self, key, value)
           
def some_function(parser):
    pass

if __name__ == "__main__":

    print("Initialize creation/setting")
    parser = Parser()
    parser['x'] = 1

    print("Single serial call works fine")
    some_function(parser)

    print("Parallel async call breaks on line 16?")
    pool = mp.Pool(1)
    for i in range(1):
        pool.apply_async(some_function, (parser,))

    pool.close()
    pool.join()

If you run the code below, you'll get the following output

Initialize creation/setting
    __init__
    __setitem__
Single serial call works fine
Parallel async call breaks on line 16?
    __setitem__
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
    task = get()
  File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/queues.py", line 354, in get
    return _ForkingPickler.loads(res)
  File "test_apply_async.py", line 13, in __setitem__
    self.section_delimiter
AttributeError: 'Parser' object has no attribute 'section_delimiter'

Any help is greatly appreciated. I spent considerable time tracking down this bug and reproducing a minimal example. I would love to not only fix it, but clearly fill some gap in my understanding on how these apply_async and inheritance/overridden methods interact.

Let me know if you need any more information.

Thank you very much!

Isaac


Solution

  • Cause

    The cause of the problem is that multiprocessing serializes and deserializes your Parser object to move its data across process boundaries. This is done using pickle. By default pickle does not call __init__() when deserializing classes. Because of this self.section_delimiter is not set when the deserializer calls __setitem__() to restore the items in your dictionary and you get the error:

    AttributeError: 'Parser' object has no attribute 'section_delimiter'

    Using just pickle and no multiprocessing gives the same error:

    import pickle
    
    parser = Parser()
    parser['x'] = 1
    
    data = pickle.dumps(parser)
    copy = pickle.loads(data) # Same AttributeError here
    

    Deserialization will work for an object with no items and the value of section_delimiter will be restored:

    import pickle
    
    parser = Parser()
    parser.section_delimiter = "|"
    
    data = pickle.dumps(parser)
    copy = pickle.loads(data)
    
    print(copy.section_delimiter) # Prints "|"
    

    So in a sense you are just unlucky that pickle calls __setitem__() before it restores the rest of the state of your Parser.

    Workaround

    You can work around this by setting section_delimiter in __new__() and telling pickle what arguments to pass to __new__() by implementing __getnewargs__():

    def __new__(cls, *args):
        self = super(Parser, cls).__new__(cls)
        self.section_delimiter = args[0] if args else "."
        return self
    
    def __getnewargs__(self):
        return (self.section_delimiter,)
    

    __getnewargs__() returns a tuple of arguments. Because section_delimiter is set in __new__(), it is no longer necessary to set it in __init__().

    This is the code of your Parser class after the change:

    class Parser(dict):
    
        def __init__(self, file_name : str = None):
            print('\t__init__')
            super().__init__()
    
        def __new__(cls, *args):
            self = super(Parser, cls).__new__(cls)
            self.section_delimiter = args[0] if args else "."
            return self
    
        def __getnewargs__(self):
            return (self.section_delimiter,)
     
        def __setitem__(self, key, value):
            print('\t__setitem__')
            self.section_delimiter
            dict.__setitem__(self, key, value)
    

    Simpler solution

    The reason pickle calls __setitem__() on your Parser object is because it is a dictionary. If your Parser is just a class that happens to implement __setitem__() and __getitem__() and has a dictionary to implement those calls then pickle will not call __setitem__() and serialization will work with no extra code:

    class Parser:
    
        def __init__(self, file_name : str = None):
            print('\t__init__')
            self.dict = { }
            self.section_delimiter = "."
    
        def __setitem__(self, key, value):
            print('\t__setitem__')
            self.section_delimiter
            self.dict[key] = value
    
        def __getitem__(self, key):
            return self.dict[key]
    

    So if there is no other reason for your Parser to be a dictionary, I would just not use inheritance here.