I've been fighting with this problem for some time now and I've finally managed to narrow down the issue and create a minimum working example.
The summary of the problem is that I have a class that inherits from a dict
to facilitate parsing of misc. input files. I've overridden the the __setitem__
call to support recursive indexing of sections in our input file (e.g. parser['some.section.variable']
is equivalent to parser['some']['section']['variable']
). This has been working great for us for over a year now, but we just ran into an issue when passing these Parser
classes through a multiprocessing.apply_async
call.
Show below is the minimum working example - obviously the __setitem__
call isn't doing anything special, but it's important that it accesses some class attribute like self.section_delimiter
- this is where it breaks. It doesn't break in the initial call or in the serial function call. But when you call the some_function
(which doesn't do anything either) using apply_async
, it crashes.
import multiprocessing as mp
import numpy as np
class Parser(dict):
def __init__(self, file_name : str = None):
print('\t__init__')
super().__init__()
self.section_delimiter = "."
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
dict.__setitem__(self, key, value)
def some_function(parser):
pass
if __name__ == "__main__":
print("Initialize creation/setting")
parser = Parser()
parser['x'] = 1
print("Single serial call works fine")
some_function(parser)
print("Parallel async call breaks on line 16?")
pool = mp.Pool(1)
for i in range(1):
pool.apply_async(some_function, (parser,))
pool.close()
pool.join()
If you run the code below, you'll get the following output
Initialize creation/setting
__init__
__setitem__
Single serial call works fine
Parallel async call breaks on line 16?
__setitem__
Process ForkPoolWorker-1:
Traceback (most recent call last):
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
self.run()
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/process.py", line 99, in run
self._target(*self._args, **self._kwargs)
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/pool.py", line 110, in worker
task = get()
File "/home/ijw/miniconda3/lib/python3.7/multiprocessing/queues.py", line 354, in get
return _ForkingPickler.loads(res)
File "test_apply_async.py", line 13, in __setitem__
self.section_delimiter
AttributeError: 'Parser' object has no attribute 'section_delimiter'
Any help is greatly appreciated. I spent considerable time tracking down this bug and reproducing a minimal example. I would love to not only fix it, but clearly fill some gap in my understanding on how these apply_async
and inheritance/overridden methods interact.
Let me know if you need any more information.
Thank you very much!
Isaac
The cause of the problem is that multiprocessing
serializes and deserializes your Parser
object to move its data across process boundaries. This is done using pickle. By default pickle does not call __init__()
when deserializing classes. Because of this self.section_delimiter
is not set when the deserializer calls __setitem__()
to restore the items in your dictionary and you get the error:
AttributeError: 'Parser' object has no attribute 'section_delimiter'
Using just pickle and no multiprocessing gives the same error:
import pickle
parser = Parser()
parser['x'] = 1
data = pickle.dumps(parser)
copy = pickle.loads(data) # Same AttributeError here
Deserialization will work for an object with no items and the value of section_delimiter
will be restored:
import pickle
parser = Parser()
parser.section_delimiter = "|"
data = pickle.dumps(parser)
copy = pickle.loads(data)
print(copy.section_delimiter) # Prints "|"
So in a sense you are just unlucky that pickle calls __setitem__()
before it restores the rest of the state of your Parser
.
You can work around this by setting section_delimiter
in __new__()
and telling pickle what arguments to pass to __new__()
by implementing __getnewargs__()
:
def __new__(cls, *args):
self = super(Parser, cls).__new__(cls)
self.section_delimiter = args[0] if args else "."
return self
def __getnewargs__(self):
return (self.section_delimiter,)
__getnewargs__()
returns a tuple of arguments. Because section_delimiter
is set in __new__()
, it is no longer necessary to set it in __init__()
.
This is the code of your Parser
class after the change:
class Parser(dict):
def __init__(self, file_name : str = None):
print('\t__init__')
super().__init__()
def __new__(cls, *args):
self = super(Parser, cls).__new__(cls)
self.section_delimiter = args[0] if args else "."
return self
def __getnewargs__(self):
return (self.section_delimiter,)
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
dict.__setitem__(self, key, value)
The reason pickle calls __setitem__()
on your Parser
object is because it is a dictionary. If your Parser
is just a class that happens to implement __setitem__()
and __getitem__()
and has a dictionary to implement those calls then pickle will not call __setitem__()
and serialization will work with no extra code:
class Parser:
def __init__(self, file_name : str = None):
print('\t__init__')
self.dict = { }
self.section_delimiter = "."
def __setitem__(self, key, value):
print('\t__setitem__')
self.section_delimiter
self.dict[key] = value
def __getitem__(self, key):
return self.dict[key]
So if there is no other reason for your Parser
to be a dictionary, I would just not use inheritance here.