Search code examples
pythonmultiprocessingpython-multiprocessing

What objects and variables are copied to child processes (by pickling) when I use Python 3 multiprocessing.pool?


I'm struggling to find answers on what objects and variables are copied to child processes when creating a multiprocessing pool in Python 3.

In other words, say I have a huge list (~230000000 elements) stored in a class that implements a function that uses a pool of four child processes. Will this list then be copied across to all four child processes if...

  1. the child processes do not read from the list?
  2. the child processes read from the list (however, the list is not modified)?

Solution

  • Note: this answer is partial in the sense that I too couldn't (yet) find written evidence and documentation about this, but the following gives some kind of empirical data, if you will.


    The following code is used to demonstrate how data is being passed/copied to child processes using a Pool (the actual list l is not used on purpose in the map to allow clean printings):

    from multiprocessing import Pool
    import os
    
    def process(x):
        print(os.getpid(), __name__, 'l' in globals())
    
    # A - l = list(range(100000))
    if __name__ == "__main__":
        # B - l = list(range(100000))
        with Pool() as pool:
            pool.map(process, [1,2,3,4])
    
        print(os.getpid(), __name__, 'l' in globals())
    

    On Windows

    When uncommenting comment A, a printout similar to:

    19604 __mp_main__ True
    6392 __mp_main__ True
    19604 __mp_main__ True
    7048 __mp_main__ True
    6568 __main__ True
    

    will be given. This is because the list is defined outside the __name__ guard, and as the processes in Windows basically import the py file, they all define their own version of l.

    When uncommenting comment B, a printout similar to:

    7248 __mp_main__ False
    22644 __mp_main__ False
    22676 __mp_main__ False
    16520 __mp_main__ False
    19736 __main__ True
    

    will be given. i.e. as the the list is defined inside the __name__ guard, only the __main__ process have it defined and it passes the arguments through map to the different processes.

    On Linux

    Uncommenting any of the comments will give a printout similar to:

    25261 __main__ True
    25262 __main__ True
    25263 __main__ True
    25264 __main__ True
    25260 __main__ True
    

    I am guessing that this is because Linux uses fork to create the spawned processes, where the processes are being "cloned" so the list will be defined either way.