Search code examples
python-3.xmultiprocessingpython-multiprocessing

Initializer method pattern in Python's multiprocessing


Consider the following example:

from multiprocessing import get_context

class A:
    def __init__(self):
        self.prop = 1
    def initializer(self):
        print("initializer", id(self))
        self.prop = 2
    def job(self, i):
        print("job", id(self), i, self.prop)

a = A()
ctx = get_context("fork")  # same happens with "spawn"
with ctx.Pool(1, a.initializer) as p:
    for _ in p.map(a.job, range(5)):
        pass

This produces the output:

initializer 4382254080
job 4383049472 0 1
job 4383049472 1 1
job 4383048224 2 1
job 4383048224 3 1
job 4383049472 4 1

We can see that the initializer is not run in the child process, and even worse, id(self) is not constant even though I only use 1 child process here.

My question: I'd like to make this work without resorting to the hacky pattern that I usually use here, which is to change A.initializer to read:

    def initializer(self):
        print("initializer", id(self))
        A.prop = 2

Is there a saner way to achieve the same result, or a preferable pattern?

A related question is, when doing the same thing in a more functional context, I'll usually assign properties to the function I'm using, like:

def initializer():
    job.prop = 2

def job(i):
    print(job.prop, i)

ctx = get_context("fork")
with ctx.Pool(1, initializer) as p:
    for _ in p.map(job, range(5)):
        pass

I'm not super happy with this pattern either, but I see it everywhere. I'm looking for more reasonable ways to achieve either or both of these results. I'd prefer to avoid global variables, especially in the object-oriented example.


Solution

  • Python makes a new copy of a each time you send an instance method over pickle (how target and args are sent to a process (old explanation still somewhat relevant)), so the init function as well as each part of the map get a new copy of a. For this reason it is frequently unhelpful to use instance methods (because the instance isn't shared anyway). A staticmethod leaves no ambiguity that you're going to get a particular instance.

    It is common to use the init function of a Pool to setup some global constants or some other sort of resource which will be shared with each "task" the pool operates on, but you must make sure you're actually modifying, then accessing the same object. If you create a mutable object that exists before fork, then access it as a global you can achieve the same (or similar) result to what you want to achieve. In this instance I would probably choose to modify a class attribute as my "global". That said, however you still have the issue of the callable sent to map doing the same thing (giving you a new instance for each call)

    No matter how I spin it, I almost always come around to the solution of creating a small (separate) function for passing to init and map in order to create a global instance, then to access that global and call it's methods:

    class A:
        def __init__(self):
            self.prop = 1
        def initializer(self):
            print("initializer", id(self))
            self.prop = 2
        def job(self, i):
            print("job", id(self), i, self.prop)
    
    def pool_init():
        global a
        a = A()
        a.initializer()
    
    def pool_map_func(*args):
        global a
        a.job(*args)
        
    if __name__ == "__main__":
        #a = A() #no point in instantiating A in main process as the children will have separate instances
    
        ctx = get_context("spawn")
        with ctx.Pool(1, pool_init) as p:
            for _ in p.map(pool_map_func, range(5)):
                pass
    

    I also want to point out that id is not relevant between processes. Even if the id is the same, if the two objects are in different processes; they are not the same object. Depending on implementation specifically when using "fork" this may indicate that the object has not been written yet (copy on write). But relying on COW to not touch most of your objects isn't great in python anyway due to the garbage collector (unless you wanna get really into the weeds).