Search code examples
pythonpicklepython-multiprocessing

multiprocess initializer and pickling


I've been playing around with multiprocessing.Pool and trying to understand how exactly the initializer argument works. From what I understand, the initializer function is called for each process, so I assumed the arguments to it (i.e. initargs) would have to be pickled across process boundaries. I know that the map method of a pool also uses pickling for its arguments, so I assumed that anything that works as an argument for the initializer should also work as an argument for mapping.

However, when I run the following piece of code, initialize gets called just fine, but then map throws an exception about not being able to pickle the module. (There's nothing special about using the current module as the argument; it was just the first non-pickleable object that came to mind.) Does anyone know what could be behind this difference?

from __future__ import print_function
import multiprocessing
import sys


def get_pid():
    return multiprocessing.current_process().pid


def initialize(module):
    print('Got module {} in PID {}'.format(module, get_pid()))


def worker(module):
    print('Got module {} in PID {}'.format(module, get_pid()))


current_module = sys.modules[__name__]
work = [current_module]

print('Main process has PID {}'.format(get_pid()))
pool = multiprocessing.Pool(None, initialize, work)
pool.map(worker, work)

Solution

  • Initialize doesn't require pickling, but the map call does. Maybe this will shed some light… (here I'm using multiprocess instead of multiprocessing to give better pickling and interactivity).

    >>> from __future__ import print_function
    >>> import multiprocess as multiprocessing
    >>> import sys
    >>> 
    >>> def get_pid():
    ...     return multiprocessing.current_process().pid
    ... 
    >>> 
    >>> def initialize(module):
    ...     print('Got module {} in PID {}'.format(module, get_pid()))
    ... 
    >>> 
    >>> def worker(module):
    ...     print('Got module {} in PID {}'.format(module, get_pid()))
    ... 
    >>> 
    >>> current_module = sys.modules[__name__]
    >>> work = [current_module]
    >>> 
    >>> print('Main process has PID {}'.format(get_pid()))
    Main process has PID 34866
    >>> pool = multiprocessing.dummy.Pool(None, initialize, work)
    Got module <module '__main__' (built-in)> in PID 34866
    Got module <module '__main__' (built-in)> in PID 34866
    Got module <module '__main__' (built-in)> in PID 34866
    Got module <module '__main__' (built-in)> in PID 34866
    Got module <module '__main__' (built-in)> in PID 34866
    Got module <module '__main__' (built-in)> in PID 34866
    Got module <module '__main__' (built-in)> in PID 34866
    Got module <module '__main__' (built-in)> in PID 34866
    >>> pool.map(worker, work)
    Got module <module '__main__' (built-in)> in PID 34866
    [None]
    

    Cool. The threading pool works… (because it doesn't need to pickle anything). How about when we are shipping both worker and work using serialization?

    >>> pool = multiprocessing.Pool(None, initialize, work)
    Got module <module '__main__' (built-in)> in PID 34875
    Got module <module '__main__' (built-in)> in PID 34876
    Got module <module '__main__' (built-in)> in PID 34877
    Got module <module '__main__' (built-in)> in PID 34878
    Got module <module '__main__' (built-in)> in PID 34879
    Got module <module '__main__' (built-in)> in PID 34880
    Got module <module '__main__' (built-in)> in PID 34881
    Got module <module '__main__' (built-in)> in PID 34882
    >>> pool.map(worker, work)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 251, in map
        return self.map_async(func, iterable, chunksize).get()
      File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 567, in get
        raise self._value
    NotImplementedError: pool objects cannot be passed between processes or pickled
    >>> 
    

    So let's look at pickling work:

    >>> import pickle
    >>> import sys            
    >>> pickle.dumps(sys.modules[__name__])
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 1374, in dumps
        Pickler(file, protocol).dump(obj)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
        self.save(obj)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 306, in save
        rv = reduce(self.proto)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
        raise TypeError, "can't pickle %s objects" % base.__name__
    TypeError: can't pickle module objects
    >>> 
    

    So, you can't pickle a module… ok, can we do better using dill?

    >>> import dill
    >>> dill.detect.trace(True)
    >>> dill.pickles(work)
    M1: <module '__main__' (built-in)>
    F2: <function _import_module at 0x10c017cf8>
    # F2
    D2: <dict object at 0x10d9a8168>
    M2: <module 'dill' from '/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/__init__.pyc'>
    # M2
    F1: <function worker at 0x10c07fed8>
    F2: <function _create_function at 0x10c017488>
    # F2
    Co: <code object worker at 0x10b053cb0, file "<stdin>", line 1>
    F2: <function _unmarshal at 0x10c017320>
    # F2
    # Co
    D1: <dict object at 0x10af68168>
    # D1
    D2: <dict object at 0x10c0e4a28>
    # D2
    # F1
    M2: <module 'sys' (built-in)>
    # M2
    F1: <function initialize at 0x10c07fe60>
    Co: <code object initialize at 0x10b241f30, file "<stdin>", line 1>
    # Co
    D1: <dict object at 0x10af68168>
    # D1
    D2: <dict object at 0x10c0ea398>
    # D2
    # F1
    M2: <module 'pathos' from '/Users/mmckerns/lib/python2.7/site-packages/pathos-0.2a1.dev0-py2.7.egg/pathos/__init__.pyc'>
    # M2
    C2: __future__._Feature
    # C2
    D2: <dict object at 0x10b05b7f8>
    # D2
    M2: <module 'multiprocess' from '/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/__init__.pyc'>
    # M2
    T4: <class 'pathos.threading.ThreadPool'>
    # T4
    D2: <dict object at 0x10c0ea5c8>
    # D2
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 1209, in pickles
        pik = copy(obj, **kwds)
      File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 161, in copy
        return loads(dumps(obj, *args, **kwds))
      File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 197, in dumps
        dump(obj, file, protocol, byref, fmode, recurse)#, strictio)
      File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 190, in dump
        pik.dump(obj)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 224, in dump
        self.save(obj)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
        f(self, obj) # Call unbound method with explicit self
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 600, in save_list
        self._batch_appends(iter(obj))
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 636, in _batch_appends
        save(tmp[0])
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
        f(self, obj) # Call unbound method with explicit self
      File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 1116, in save_module
        state=_main_dict)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 419, in save_reduce
        save(state)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 286, in save
        f(self, obj) # Call unbound method with explicit self
      File "/Users/mmckerns/lib/python2.7/site-packages/dill-0.2.5.dev0-py2.7.egg/dill/dill.py", line 768, in save_module_dict
        StockPickler.save_dict(pickler, obj)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 649, in save_dict
        self._batch_setitems(obj.iteritems())
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 681, in _batch_setitems
        save(v)
      File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", line 306, in save
        rv = reduce(self.proto)
      File "/Users/mmckerns/lib/python2.7/site-packages/multiprocess-0.70.4.dev0-py2.7-macosx-10.8-x86_64.egg/multiprocess/pool.py", line 452, in __reduce__
        'pool objects cannot be passed between processes or pickled'
    NotImplementedError: pool objects cannot be passed between processes or pickled
    >>> 
    

    The answer is yes -- the module starts to pickle, however, fails due to the contents in the module… so it looks like it works for everything in __main__ except when there's an instance of a pool in __main__ -- then it will fail.

    So, if your last two lines of code were replaced with this one, it will work:

    >>> multiprocessing.Pool(None, initialize, work).map(worker, work)
    Got module <module '__main__' (built-in)> in PID 34922
    Got module <module '__main__' (built-in)> in PID 34923
    Got module <module '__main__' (built-in)> in PID 34924
    Got module <module '__main__' (built-in)> in PID 34925
    Got module <module '__main__' (built-in)> in PID 34926
    Got module <module '__main__' (built-in)> in PID 34927
    Got module <module '__main__' (built-in)> in PID 34928
    Got module <module '__main__' (built-in)> in PID 34929
    Got module <module '__main__' (built-in)> in PID 34922
    [None]
    >>> 
    

    That's using multiprocess, as it uses dill under the covers. pickle will still fail to pickle here because pickle can't serialize a module. Serialization is needed, as the object have to be sent to another python instance on another process.