Search code examples
pythonmultiprocessingpython-multiprocessing

Statements before multiprocessing main() executed multiple times (Python)


I am learning Python and its multiprocessing.

I created a project with a mian() in main.py and a a_simulation inside the module simulation.py under the package simulator/.

The symptom is that a test statement print("hello\n") inside main.py before the definition of mian() is executed multiple times when the program is run with python main.py, indicating things before the print, including the creations of the lists are all executed multiple times.

I do not think I understand the related issues of python very well. May I know what is reason for the symptom and what is the best practice in python when creating projects like this? I have included the codes and the terminal prints. Thank you!

Edit: Forgot to mention that I am running it with anaconda python on macOS, although I would wish my project will work just fine on any platforms.

mian.py:

from multiprocessing import Pool
from simulator.simulation import a_simulation
import random

num_trials = 10

iter_trials = list(range(num_trials))
arg_list = [random.random() for _ in range(num_trials)]

input = list(zip(iter_trials, arg_list))

print("hello\n")

def main():
    with Pool(processes=4) as pool:
        result = pool.starmap(a_simulation, input)
        print(result)


if __name__ == "__main__":
    main()

simulatior/simulation.py:

import os
from time import sleep

def a_simulation(x, seed_):

    print(f"Process {os.getpid()}: trial {x} received {seed_}\n" )
    sleep(1)

    return seed_

Results from the terminal:

hello

hello

hello

hello

hello

Process 71539: trial 0 received 0.4512600158461971

Process 71538: trial 1 received 0.8772526554425158

Process 71541: trial 2 received 0.6893833978242683

Process 71540: trial 3 received 0.29249994820563296

Process 71538: trial 4 received 0.5759647958461107

Process 71541: trial 5 received 0.08799525261308505

Process 71539: trial 6 received 0.3057644321667139

Process 71540: trial 7 received 0.5402091856171599

Process 71538: trial 8 received 0.1373456223147438

Process 71541: trial 9 received 0.24000943476017

[0.4512600158461971, 0.8772526554425158, 0.6893833978242683, 0.29249994820563296, 0.5759647958461107, 0.08799525261308505, 0.3057644321667139, 0.5402091856171599, 0.1373456223147438, 0.24000943476017]
(base)

Solution

  • The reason why this happens is because multiprocessing uses start method spawn, by default, on Windows and macOS to start new processes. What this means is that whenever you want to start a new process, the child process is initially created without sharing any of the memory of the parent. However, this makes things messy when you want to start a function in the child process from the parent because not only will the child not know the definition of the function itself, you might also run into some unexpected obstacles (what if the function depends on a variable defined in the parent processes' module?). To stop these sorts of things from happening, multiprocessing automatically imports the parent processes' module from the child process, which essentially copies almost the entire state of the parent when the child process was started.

    This is where the if __name__ == "__main__" comes in. This statement basically translates to if the current file is being run directly then..., the code under this block will not run if the module is being imported. Therefore, the child process will not run anything under this block when they are spawned. You can hence use this block to create, for example, variables which use up a lot of memory and are not required for the child processes to function but are used by the parent. Basically, anything that the child processes won't need, throw it under here.

    Now coming to your comment about imports:

    This must be a silly questions, but should I leave the import statements as they are, or move them inside if name == "main":, or somewhere else? Thanks

    Like I said, anything that the child doesn't need can be put under this if block. The reason you don't often see imports under this block is perhaps due to sticking to convention ("imports should be done at the top") and because the modules being imported don't really affect performance much (even after being needlessly imported multiple times). Keep in mind however, that if a child process requires a particular module to start its work, it will always be imported again within the child process, even if you have imported it under the if __name__... block. This is because when you attempt to spawn child processes to start a function in parallel, multiprocessing automatically serializes and sends the names of the function, and the module that defines the function (actual code is not serialized, only the names), to the child processes where they are imported once more (relevant question).

    This is only specific to when the start method is spawn, you can read more about the differences here