Search code examples
pythonwindowsconcurrencymultiprocessingpython-multiprocessing

Python multiprocessing CPU bound concurrency without cloning main


I am currently writing a CPU Bound script using python multiprocessing. The main module has a lot of import statements and such that are creating overhead on the start of new processes. These imports are not necessary to run the parallel portion of the program and so I would like to not import them. This could be fixed by placing all of my import statements in if __name__ == "__main__": but this is a large code bank and the parallel processed module may be used by many developers of varying experience levels. (I don't want to fix everything and don't want to let other people break it).

I would like to only import the necessary modules to run the parallel processed portion of the code. I've found a workaround but it strikes me as... hacky. I update sys.modules and say that __main__ is the module with the parallel processing and then put main back when I'm done. For instance:

try:
    main = sys.modules["__main__"]
    sys.modules.update({"__main__": sys.modules[__name__]})

    jobs = []
    for i in range(5):
        p = multiprocessing.Process(target=worker)
        jobs.append(p)
        p.start()
finally:
    sys.modules.update({"__main__": main})

This code runs and only imports the desired module. But I am concerned there is some horrible consequence hidden under the surface.

Ideally I would like an alternative to multiprocessing that gives me more control over what is cloned at process spawn. Does anyone have a suggestion, a less horrifying workaround or reassurance that my work around isn't as horrifying as I fear?

I am using Windows and python 3.5.

Thanks!


Solution

  • My guess is that joblib will do a better job, see this very complete discussion for more.