In normal Python code, I can understand the lifecycle of the process. e.g. when executing python script.py
:
python script.py
, and os creates a new process to start executing python
.python
executable sets up the interpreter, and starts to execute script.py
.script.py
finishes execution, python interpreter will exit.In the case of multiprocessing, I'm curious about what happens to the other processes.
Take the following code for example:
# test.py
import multiprocessing
# Function to compute square of a number
def compute_square(number):
print(f'The square of {number} is {number * number}')
if __name__ == '__main__':
# List of numbers
numbers = [1, 2, 3, 4, 5]
# Create a list to keep all processes
processes = []
# Create a process for each number to compute its square
for number in numbers:
process = multiprocessing.Process(target=compute_square, args=(number,))
processes.append(process)
process.start()
# Ensure all processes have finished execution
for process in processes:
process.join()
print("All processes have finished execution.")
When I execute python test.py
, I understand that test.py
will be executed as __main__
module. But what happens to the other processes?
To be specific, when I execute multiprocessing.Process(target=compute_square, args=(number,)).start()
, what happens to that process?
How does that process invoke the python interpreter? if it is just python script.py
, how does it know it needs to execute a function named compute_square
? or does it uses python -i
, and pass the command to execute through a pipe?
As per the Python documentation for the multiprocessing
module, the underlying system feature used to create the process depends on the platform, with 3 different "start methods": spawn
, fork
, and forkserver
.
Which one is used by default depends on the platform, although you can choose the start method yourself with multiprocessing.set_start_method()
, taking the method name as a string.
When you use fork()
on a POSIX system, the child process is pretty much a clone of its parent – aside from its PID of course and other necessary differences. It runs the same code from the same point in memory, and while their memory pages are initially shared, a dedicated copy is made for each process as it writes to any shared page. This is the easiest model to understand in my opinion, just think of the new process as a complete copy of the initial process, except that it knows it is the child process (as opposed to the parent), and decides to run the requested function as a result.
If you're unfamiliar with fork()
I would encourage you to read up on it, maybe starting with the Wikipedia article and then the man page.
spawn
also has a Wikipedia article and a man page.
As for Python, we can test all 3 start methods with a simple program that takes spawn
, fork
, or forkserver
as its first argument:
from multiprocessing import Process, set_start_method
import os
import random
import sys
# generate a random value in the parent process
r = random.randint(0, int(1e9))
def info(msg):
print(f'pid: {os.getpid()}, ppid: {os.getppid()}, module: {__name__}, msg: {msg}')
def f(arg):
info(f'[child] arg: {arg}, r: {r}')
if __name__ == '__main__':
if len(sys.argv) != 2:
print(f'Usage: {sys.argv[0]} spawn|fork|forkserver')
sys.exit(1)
method = sys.argv[1]
print(f'setting start method: {method}')
set_start_method(method)
info(f'[parent] r: {r}')
p = Process(target=f, args=('hello',))
p.start()
p.join()
When spawn
is used, a fresh interpreter is created and the process' entry point function is called. See how the random variable r
is not preserved:
setting start method: spawn
pid: 44658, ppid: 1356, module: __main__, msg: [parent] r: 242489315
pid: 44688, ppid: 44658, module: __mp_main__, msg: [child] arg: hello, r: 229487814
With fork
, we are cloning the process right at the point where we call p.start()
, so we still see the same value for r
:
setting start method: fork
pid: 49721, ppid: 1356, module: __main__, msg: [parent] r: 376097656
pid: 49738, ppid: 49721, module: __main__, msg: [child] arg: hello, r: 376097656
And with forkserver
, I would have expected multiple processes to inherit the same value for r
, but this doesn't seem to be the case:
setting start method: forkserver
pid: 66317, ppid: 1356, module: __main__, msg: [parent] r: 917735863
pid: 66336, ppid: 66334, module: __mp_main__, msg: [child] arg: hello, r: 698876823
Starting and stopping a second process just after the first one doesn't give it the same value for r
as either the root process or the first child process, indicating that the server process that child processes are forked from may have started even earlier than the start of the script.