I've been playing around with multiprocessing.Pool
and trying to understand how exactly the initializer
argument works. From what I understand, the initializer function is called for each process, so I assumed the arguments to it (i.e. initargs
) would have to be pickled across process boundaries. I know that the map
method of a pool also uses pickling for its arguments, so I assumed that anything that works as an argument for the initializer should also work as an argument for mapping.
However, when I run the following piece of code, initialize
gets called just fine, but then map
throws an exception about not being able to pickle the module. (There's nothing special about using the current module as the argument; it was just the first non-pickleable object that came to mind.) Does anyone know what could be behind this difference?
from __future__ import print_function
import multiprocessing
import sys
def get_pid():
return multiprocessing.current_process().pid
def initialize(module):
print('Got module {} in PID {}'.format(module, get_pid()))
def worker(module):
print('Got module {} in PID {}'.format(module, get_pid()))
current_module = sys.modules[__name__]
work = [current_module]
print('Main process has PID {}'.format(get_pid()))
pool = multiprocessing.Pool(None, initialize, work)
pool.map(worker, work)
Initialize doesn't require pickling, but the map
call does. Maybe this will shed some light… (here I'm using multiprocess
instead of multiprocessing
to give better pickling and interactivity).
>>> from __future__ import print_function
>>> import multiprocess as multiprocessing
>>> import sys
>>>
>>> def get_pid():
... return multiprocessing.current_process().pid
...
>>>
>>> def initialize(module):
... print('Got module {} in PID {}'.format(module, get_pid()))
...
>>>
>>> def worker(module):
... print('Got module {} in PID {}'.format(module, get_pid()))
...
>>>
>>> current_module = sys.modules[__name__]
>>> work = [current_module]
>>>
>>> print('Main process has PID {}'.format(get_pid()))
Main process has PID 34866
>>> pool = multiprocessing.dummy.Pool(None, initialize, work)
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
Got module <module '__main__' (built-in)> in PID 34866
>>> pool.map(worker, work)
Got module <module '__main__' (built-in)> in PID 34866
[None]
Cool. The threading pool
works… (because it doesn't need to pickle anything). How about when we are shipping both worker
and work
using serialization?
>>> pool = multiprocessing.Pool(None, initialize, work)
Got module <module '__main__' (built-in)> in PID 34875
Got module <module '__main__' (built-in)> in PID 34876
Got module <module '__main__' (built-in)> in PID 34877
Got module <module '__main__' (built-in)> in PID 34878
Got module <module '__main__' (built-in)> in PID 34879
Got module <module '__main__' (built-in)> in PID 34880
Got module <module '__main__' (built-in)> in PID 34881
Got module <module '__main__' (built-in)> in PID 34882
>>> pool.map(worker, work)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 251, in map
return self.map_async(func, iterable, chunksize).get()
File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 567, in get
raise self._value
NotImplementedError: pool objects cannot be passed between processes or pickled
>>>
So let's look at pickling work
:
>>> import pickle
>>> import sys
>>> pickle.dumps(sys.modules[__name__])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1374, in dumps
Pickler(file, protocol).dump(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle module objects
>>>
So, you can't pickle a module… ok, can we do better using dill
?
>>> import dill
>>> dill.detect.trace(True)
>>> dill.pickles(work)
M1: <module '__main__' (built-in)>
F2: <function _import_module at 0x10c017cf8>
# F2
D2: <dict object at 0x10d9a8168>
M2: <module 'dill' from '/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/__init__.pyc'>
# M2
F1: <function worker at 0x10c07fed8>
F2: <function _create_function at 0x10c017488>
# F2
Co: <code object worker at 0x10b053cb0, file "<stdin>", line 1>
F2: <function _unmarshal at 0x10c017320>
# F2
# Co
D1: <dict object at 0x10af68168>
# D1
D2: <dict object at 0x10c0e4a28>
# D2
# F1
M2: <module 'sys' (built-in)>
# M2
F1: <function initialize at 0x10c07fe60>
Co: <code object initialize at 0x10b241f30, file "<stdin>", line 1>
# Co
D1: <dict object at 0x10af68168>
# D1
D2: <dict object at 0x10c0ea398>
# D2
# F1
M2: <module 'pathos' from '/Users/mmckerns/lib/python2.7/site-packages/pathos-0.2a1.dev0-py2.7.egg/pathos/__init__.pyc'>
# M2
C2: __future__._Feature
# C2
D2: <dict object at 0x10b05b7f8>
# D2
M2: <module 'multiprocess' from '/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/__init__.pyc'>
# M2
T4: <class 'pathos.threading.ThreadPool'>
# T4
D2: <dict object at 0x10c0ea5c8>
# D2
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 1209, in pickles
pik = copy(obj, **kwds)
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 161, in copy
return loads(dumps(obj, *args, **kwds))
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 197, in dumps
dump(obj, file, protocol, byref, fmode, recurse)#, strictio)
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 190, in dump
pik.dump(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
self.save(obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 600, in save_list
self._batch_appends(iter(obj))
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 636, in _batch_appends
save(tmp[0])
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 1116, in save_module
state=_main_dict)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 419, in save_reduce
save(state)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 768, in save_module_dict
StockPickler.save_dict(pickler, obj)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems
save(v)
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 306, in save
rv = reduce(self.proto)
File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 452, in __reduce__
'pool objects cannot be passed between processes or pickled'
NotImplementedError: pool objects cannot be passed between processes or pickled
>>>
The answer is yes -- the module starts to pickle, however, fails due to the contents in the module… so it looks like it works for everything in __main__
except when there's an instance of a pool
in __main__
-- then it will fail.
So, if your last two lines of code were replaced with this one, it will work:
>>> multiprocessing.Pool(None, initialize, work).map(worker, work)
Got module <module '__main__' (built-in)> in PID 34922
Got module <module '__main__' (built-in)> in PID 34923
Got module <module '__main__' (built-in)> in PID 34924
Got module <module '__main__' (built-in)> in PID 34925
Got module <module '__main__' (built-in)> in PID 34926
Got module <module '__main__' (built-in)> in PID 34927
Got module <module '__main__' (built-in)> in PID 34928
Got module <module '__main__' (built-in)> in PID 34929
Got module <module '__main__' (built-in)> in PID 34922
[None]
>>>
That's using multiprocess
, as it uses dill
under the covers. pickle
will still fail to pickle here because pickle
can't serialize a module. Serialization is needed, as the object have to be sent to another python instance on another process.