Search code examples
pythonmultiprocessingpython-multiprocessing

Why is executor.map from multiprocessing running the entire main again?


I have the following snippet below from a main.py script

------------------------------------------------------------------------------------------------

a bunch of code above with that has a write-to-disk process that was meant to be only run once
------------------------------------------------------------------------------------------------

context =  get_context('spawn')
someInitData1= context.Value('i', -1)
someInitData2= context.Value('i', 0)

with concurrent.futures.ProcessPoolExecutor(max_workers=4,
                                        mp_context=context,
                                        initializer=util.init_func, 
                                        initargs=(someInitData1,someInitData2)
                                       ) as executor:
        multiProcessResults= [x for x in executor.map(util.multi_process_job, 
                                           someArguments1,
                                           someArguments2,
                                          )]

I intend only to have util.multi_process_job be parallelized with multiprocessing. Although, for some reason, with this snippet, all of the code in my main.py get's repeated from the beginning and in parallel as a new process by the workers.

What is strange to me is that the following snippet works fine for my needs when I run it via jupyter notebook. Only the specified function runs. The problem only occurs when I convert the ipnyb file to .py file and run it as a regular python script on a linux machine.


Solution

  • The problem is here:

    context = get_context('spawn')
    

    ...wherein you're forcing a mode that's compatible with Windows, but creates each new process as a completely new copy of Python. Consequently, those new processes need to import your module separately; thus they rerun any code that's invoked on import.

    To avoid this, use get_context('fork') to make each new process be a copy of your existing Python process, with all the prior state (like the modules already loaded and cached in memory) available.


    Alternately, you can put all your top-level code inside an if __name__ == '__main__': gate, so it only runs when your script is executed, but not when it's imported.