I have the following snippet below from a main.py script
------------------------------------------------------------------------------------------------
a bunch of code above with that has a write-to-disk process that was meant to be only run once
------------------------------------------------------------------------------------------------
context = get_context('spawn')
someInitData1= context.Value('i', -1)
someInitData2= context.Value('i', 0)
with concurrent.futures.ProcessPoolExecutor(max_workers=4,
mp_context=context,
initializer=util.init_func,
initargs=(someInitData1,someInitData2)
) as executor:
multiProcessResults= [x for x in executor.map(util.multi_process_job,
someArguments1,
someArguments2,
)]
I intend only to have util.multi_process_job
be parallelized with multiprocessing. Although, for some reason, with this snippet, all of the code in my main.py get's repeated from the beginning and in parallel as a new process by the workers.
What is strange to me is that the following snippet works fine for my needs when I run it via jupyter notebook. Only the specified function runs. The problem only occurs when I convert the ipnyb file to .py file and run it as a regular python script on a linux machine.
The problem is here:
context = get_context('spawn')
...wherein you're forcing a mode that's compatible with Windows, but creates each new process as a completely new copy of Python. Consequently, those new processes need to import
your module separately; thus they rerun any code that's invoked on import.
To avoid this, use get_context('fork')
to make each new process be a copy of your existing Python process, with all the prior state (like the modules already loaded and cached in memory) available.
Alternately, you can put all your top-level code inside an if __name__ == '__main__':
gate, so it only runs when your script is executed, but not when it's imported.