Search code examples
pythonpython-2.7multiprocessingpicklepathos

Interaction between pathos.ProcessingPool and pickle


I have a list of calculations I need to run. I'm parallelizing them using

from pathos.multiprocessing import ProcessingPool
pool = ProcessingPool(nodes=7)
values = pool.map(helperFunction, someArgs)

helperFunction does create a class called Parameters, which is defined in the same file as

import otherModule
class Parameters(otherModule.Parameters):
    ...

So far, so good. helperFunction will do some calculations, based on the Parameters object, change some of its attributes, and finally store them using pickle. Here's the relevant excerpt of the helper function (from a different module) that does the saving:

import pickle
import hashlib
import os
class cacheHelper():

    def __init__(self, fileName, attr=[], folder='../cache/'):
        self.folder = folder

        if len(attr) > 0:
            attr = self.attrToName(attr)
        else:
            attr = ''
        self.fileNameNaked = fileName
        self.fileName = fileName + attr

    def write(self, objects):
        with open(self.getFile(), 'wb') as output:
            for object in objects:
                pickle.dump(object, output, pickle.HIGHEST_PROTOCOL)

when it gets to pickle.dump(), it raises an Exception which is hard to debug because the debugger wont step into the worker that actually faced that exception. Therefore I created a breakpoint right before the dumping happened, and manually entered that command. Here is the output:

>>> pickle.dump(objects[0], output, pickle.HIGHEST_PROTOCOL)
Traceback (most recent call last):
  File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-1-4d2cbb7c63d1>", line 1, in <module>
    pickle.dump(objects[0], output, pickle.HIGHEST_PROTOCOL)
  File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 1376, in dump
    Pickler(file, protocol).dump(obj)
  File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 224, in dump
    self.save(obj)
  File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 396, in save_reduce
    save(cls)
  File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/site-packages/dill/dill.py", line 1203, in save_type
    StockPickler.save_global(pickler, obj)
  File "/usr/local/anaconda2/envs/myenv2/lib/python2.7/pickle.py", line 754, in save_global
    (obj, module, name))
PicklingError: Can't pickle <class '__main__.Parameters'>: it's not found as __main__.Parameters

The odd thing is that this doesn't happen when I don't parallelize, i.e. loop through helperFunction manually. I'm pretty sure that I'm opening the right Parameters (and not the parent class).

I know it is tough to debug things without a reproducible example, I don't expect any solutions on this part. Perhaps the more general question is:

What does one have to pay attention to when parallelizing code that uses pickle.dump() via another module?


Solution

  • Straight from the Python docs.

    12.1.4. What can be pickled and unpickled? The following types can be pickled:

    • None, True, and False
    • integers, floating point numbers, complex
    • strings, bytes, bytearrays
    • tuples, lists, sets, and
    • dictionaries containing only picklable objects functions defined at the top level of a module (using def, not lambda)
    • built-in functions defined at the top level of a module
    • classes that are defined at the top level of a module
    • instances of such classes whose __dict__ or the result of calling __getstate__() is picklable (see section Pickling Class Instances for details).

    Everything else can't be pickled. In your case, though it's very hard to say given the excerpt of your code, I believe the problem is that the class Parameters is not defined at the top level of the module, hence its instances can't be pickled.

    The whole point of using pathos.multiprocessing (or its actively developing fork multiprocess) instead of the built-in multiprocessing is to avoid pickle, because there are far too many things the later can't dump. pathos.multiprocessing and multiprocess use dill instead of pickle. And if you want to debug a worker, you can use trace.

    NOTE As Mike McKerns (the main contributor of multiprocess) rightfully noticed, there are cases that even dill can't handle, though it will be hard to formulate some universal rules on that matter.