Search code examples
pythonnumpymultiprocessing

Calling python multiprocessing script in different script


I have a script multiprocess.py containing something like this:

import os
import numpy as np
from multiprocessing import Pool

abc = np.load('somedata.npy')

def func(k):

    fret1 = k+abc*k
    fret2 = k+5*k^2
    fret3 = k-abc
    
    return fret1, fret2, fret3

def worker(k):
    try:
        result = func(k)
        return result
    except Exception as e:
        print(f"An error occurred for k={k}: {e}")
        return None

def run_mp():
    xyz = np.random.random(100)
    iterator= range(len(xyz))

    # Create a list to store the results
    results = []
    ret1    = []
    ret2    = []
    ret3    = []

    # Use multiprocessing.Pool to run worker function in parallel
    with Pool(os.cpu_count()-1) as pool:
        results = pool.map(worker, iterator) 
    
    # Filter out None values (occurred due to errors)
    results = [result for result in results if result is not None]
    # Unpack the results and collect ret1, ret2, and ret3 values
    for fret1, fret2, fret3 in results:
        ret1.append(fret1)
        ret2.append(fret2)
        ret3.append(fret3)

    print("Multiprocesssing complete")
    np.savez('data.npz', xyz=xyz, ret1=ret1, ret2=ret2, ret3=ret3)

if __name__ == "__main__":
    run_mp()

Standalone it runs just fine but I want to call it in another python script like so:

with open("multiprocess.py") as f:
    exec(f.read())

Unfortunately this doesn't seem to work and I get an infinite loop.

I also tried to open the script with this command:

if __name__ == "__main__":
    with open("multiprocess.py") as f:
        exec(f.read())

And also this:

from multiprocess import run_mp

run_mp()

PS: I am working and trying to run the code in Spyder on Windows if that somehow matters.


Solution

  • Your problem is caused, independently, in 2-3 different ways:

    1. When not using exec, you fail to use the import guard on the invocation of run_mp
    2. Using exec to define things will screw up multiprocessing in various ways
    3. (Depends on IDE) IDEs are typically death to multiprocessing (they don't invoke the script the normal way, and multiprocessing's required invariants aren't established correctly)

    Problem cause #1 is the simplest: Invoking run_mp outside an import guarded section means that, in any start method mode but 'fork' (which is not available on Windows) it tries to recursively invoke it in each worker, so the main process spawns a pool of workers to do work, which in turn try to spawn a pool of workers apiece to do the same work, but none of them ever managed to actually do any work.

    On problem cause #2, specifically, when dispatching tasks to workers, multiprocessing uses the pickle module to serialize the functions and their arguments. Functions are pickled by their qualified name, and pickle confirms that that qualified name is legal by looking it up directly. This has three implications:

    1. You can't pass lambdas to Pool.map, even if you've assigned them to a name, e.g. foo = lambda x, y: x + y (because the name they report is modulename.<lambda>, not modulename.foo)
    2. exec will break things when not invoked at global scope (because the module in question is not imported, and because the globals it's checking for stuff like __name__ will report the module it is being execed in, not the module it came from).
    3. Even when exec is executed at global scope (which would mean pickle could find the defined names), if the exec occurs in a if __name__ == '__main__': guarded block, it won't be executed in the child process, so the unpickling in the worker process will fail, even though the pickling in the parent process succeeded.

    On problem cause #3, multiprocessing is frequently broken by running in an IDE at all, because many IDEs actually do something quite similar to exec-ing your script from an IDE wrapper that is the actual __name__ == "__main__" script; even if the test still claims that __name__ == "__main__" in your script, when pickle checks sys.modules['__main__'] for your function, it ends up finding the skeleton module the IDE actually ran, not your "real" script, and things break (this can also happen when running a multiprocessing script under profiling, python -mcProfile myscript.py has cProfile as the sys.modules['__main__'] entry, not myscript).

    To fix these issues:

    1. Don't run multiprocessing scripts in an IDE (nor from an interactive terminal), period. Run them from the command line in an independent terminal (e.g. on Windows, cmd.exe or Powershell, ideally running via the Windows launcher if installed as admin with the box for it checked, e.g. py -3 path/to/script.py)
    2. Don't use exec (this is a good idea in general, it's a hacky way to do what importing would do for you more correctly, more concisely, and often more efficiently, since the cached import only gets loaded once per process, not loaded for every exec; there are use cases for exec, but they're very narrow, it should never be the first, second, third, or even fourth tool you're reaching for)
    3. Use your import guard consistently; when you used import, you didn't guard the invocation of run_mp, so the process of launching the workers would try to run run_mp in each worker. This is a critical part of the multiprocessing programming guidelines when running on a Windows system (where 'spawn' is the default and only available start method), and it's a good idea to adhere to it elsewhere (e.g. macOS defaults to 'spawn' in recent versions of Python, with 'fork' being available, but dangerous due to non-fork-safe system libraries, and even on Linux/BSD, some people will choose to use 'spawn' or 'forkserver' start methods when the parent process could be huge and the default 'fork' would potentially consume a lot of memory, possibly crashing the process(es) or triggering the OOM Killer due to excessive virtual memory usage).

    In short, your final example usage was almost correct (if invoked from a separate script, not an interactive session, not via an IDE). You just need to change it to:

    from multiprocess import run_mp
    
    if __name__ == "__main__":
        run_mp()