Search code examples
pythonmultiprocessingpython-multiprocessing

What is the lifecycle of a process in python multiprocessing?


In normal Python code, I can understand the lifecycle of the process. e.g. when executing python script.py:

  1. shell receives the command python script.py, and os creates a new process to start executing python .
  2. python executable sets up the interpreter, and starts to execute script.py.
  3. When script.py finishes execution, python interpreter will exit.

In the case of multiprocessing, I'm curious about what happens to the other processes.

Take the following code for example:

# test.py
import multiprocessing

# Function to compute square of a number
def compute_square(number):
    print(f'The square of {number} is {number * number}')

if __name__ == '__main__':
    # List of numbers
    numbers = [1, 2, 3, 4, 5]
    
    # Create a list to keep all processes
    processes = []
    
    # Create a process for each number to compute its square
    for number in numbers:
        process = multiprocessing.Process(target=compute_square, args=(number,))
        processes.append(process)
        process.start()
    
    # Ensure all processes have finished execution
    for process in processes:
        process.join()

    print("All processes have finished execution.")

When I execute python test.py , I understand that test.py will be executed as __main__ module. But what happens to the other processes?

To be specific, when I execute multiprocessing.Process(target=compute_square, args=(number,)).start() , what happens to that process?

How does that process invoke the python interpreter? if it is just python script.py , how does it know it needs to execute a function named compute_square? or does it uses python -i, and pass the command to execute through a pipe?


Solution

  • As per the Python documentation for the multiprocessing module, the underlying system feature used to create the process depends on the platform, with 3 different "start methods": spawn, fork, and forkserver.

    Which one is used by default depends on the platform, although you can choose the start method yourself with multiprocessing.set_start_method(), taking the method name as a string.

    When you use fork() on a POSIX system, the child process is pretty much a clone of its parent – aside from its PID of course and other necessary differences. It runs the same code from the same point in memory, and while their memory pages are initially shared, a dedicated copy is made for each process as it writes to any shared page. This is the easiest model to understand in my opinion, just think of the new process as a complete copy of the initial process, except that it knows it is the child process (as opposed to the parent), and decides to run the requested function as a result.

    If you're unfamiliar with fork() I would encourage you to read up on it, maybe starting with the Wikipedia article and then the man page.

    spawn also has a Wikipedia article and a man page.

    As for Python, we can test all 3 start methods with a simple program that takes spawn, fork, or forkserver as its first argument:

    from multiprocessing import Process, set_start_method
    import os
    import random
    import sys
    
    # generate a random value in the parent process
    r = random.randint(0, int(1e9))
    
    def info(msg):
        print(f'pid: {os.getpid()}, ppid: {os.getppid()}, module: {__name__}, msg: {msg}')
    
    def f(arg):
        info(f'[child] arg: {arg}, r: {r}')
    
    if __name__ == '__main__':
        if len(sys.argv) != 2:
            print(f'Usage: {sys.argv[0]} spawn|fork|forkserver')
            sys.exit(1)
    
        method = sys.argv[1]
        print(f'setting start method: {method}')
        set_start_method(method)
    
        info(f'[parent] r: {r}')
        p = Process(target=f, args=('hello',))
        p.start()
        p.join()
    

    When spawn is used, a fresh interpreter is created and the process' entry point function is called. See how the random variable r is not preserved:

    setting start method: spawn
    pid: 44658, ppid: 1356, module: __main__, msg: [parent] r: 242489315
    pid: 44688, ppid: 44658, module: __mp_main__, msg: [child] arg: hello, r: 229487814
    

    With fork, we are cloning the process right at the point where we call p.start(), so we still see the same value for r:

    setting start method: fork
    pid: 49721, ppid: 1356, module: __main__, msg: [parent] r: 376097656
    pid: 49738, ppid: 49721, module: __main__, msg: [child] arg: hello, r: 376097656
    

    And with forkserver, I would have expected multiple processes to inherit the same value for r, but this doesn't seem to be the case:

    setting start method: forkserver
    pid: 66317, ppid: 1356, module: __main__, msg: [parent] r: 917735863
    pid: 66336, ppid: 66334, module: __mp_main__, msg: [child] arg: hello, r: 698876823
    

    Starting and stopping a second process just after the first one doesn't give it the same value for r as either the root process or the first child process, indicating that the server process that child processes are forked from may have started even earlier than the start of the script.