I have a script multiprocess.py containing something like this:
import os
import numpy as np
from multiprocessing import Pool
abc = np.load('somedata.npy')
def func(k):
fret1 = k+abc*k
fret2 = k+5*k^2
fret3 = k-abc
return fret1, fret2, fret3
def worker(k):
try:
result = func(k)
return result
except Exception as e:
print(f"An error occurred for k={k}: {e}")
return None
def run_mp():
xyz = np.random.random(100)
iterator= range(len(xyz))
# Create a list to store the results
results = []
ret1 = []
ret2 = []
ret3 = []
# Use multiprocessing.Pool to run worker function in parallel
with Pool(os.cpu_count()-1) as pool:
results = pool.map(worker, iterator)
# Filter out None values (occurred due to errors)
results = [result for result in results if result is not None]
# Unpack the results and collect ret1, ret2, and ret3 values
for fret1, fret2, fret3 in results:
ret1.append(fret1)
ret2.append(fret2)
ret3.append(fret3)
print("Multiprocesssing complete")
np.savez('data.npz', xyz=xyz, ret1=ret1, ret2=ret2, ret3=ret3)
if __name__ == "__main__":
run_mp()
Standalone it runs just fine but I want to call it in another python script like so:
with open("multiprocess.py") as f:
exec(f.read())
Unfortunately this doesn't seem to work and I get an infinite loop.
I also tried to open the script with this command:
if __name__ == "__main__":
with open("multiprocess.py") as f:
exec(f.read())
And also this:
from multiprocess import run_mp
run_mp()
PS: I am working and trying to run the code in Spyder on Windows if that somehow matters.
Your problem is caused, independently, in 2-3 different ways:
exec
, you fail to use the import guard on the invocation of run_mp
exec
to define things will screw up multiprocessing
in various waysmultiprocessing
(they don't invoke the script the normal way, and multiprocessing
's required invariants aren't established correctly)Problem cause #1 is the simplest: Invoking run_mp
outside an import guarded section means that, in any start method mode but 'fork'
(which is not available on Windows) it tries to recursively invoke it in each worker, so the main process spawns a pool of workers to do work, which in turn try to spawn a pool of workers apiece to do the same work, but none of them ever managed to actually do any work.
On problem cause #2, specifically, when dispatching tasks to workers, multiprocessing
uses the pickle
module to serialize the functions and their arguments. Functions are pickled by their qualified name, and pickle
confirms that that qualified name is legal by looking it up directly. This has three implications:
lambda
s to Pool.map
, even if you've assigned them to a name, e.g. foo = lambda x, y: x + y
(because the name they report is modulename.<lambda>
, not modulename.foo
)exec
will break things when not invoked at global scope (because the module in question is not imported, and because the globals
it's checking for stuff like __name__
will report the module it is being exec
ed in, not the module it came from).exec
is executed at global scope (which would mean pickle
could find the defined names), if the exec
occurs in a if __name__ == '__main__':
guarded block, it won't be executed in the child process, so the unpickling in the worker process will fail, even though the pickling in the parent process succeeded.On problem cause #3, multiprocessing
is frequently broken by running in an IDE at all, because many IDEs actually do something quite similar to exec
-ing your script from an IDE wrapper that is the actual __name__ == "__main__"
script; even if the test still claims that __name__ == "__main__"
in your script, when pickle
checks sys.modules['__main__']
for your function, it ends up finding the skeleton module the IDE actually ran, not your "real" script, and things break (this can also happen when running a multiprocessing
script under profiling, python -mcProfile myscript.py
has cProfile
as the sys.modules['__main__']
entry, not myscript
).
To fix these issues:
multiprocessing
scripts in an IDE (nor from an interactive terminal), period. Run them from the command line in an independent terminal (e.g. on Windows, cmd.exe
or Powershell, ideally running via the Windows launcher if installed as admin with the box for it checked, e.g. py -3 path/to/script.py
)exec
(this is a good idea in general, it's a hacky way to do what import
ing would do for you more correctly, more concisely, and often more efficiently, since the cached import only gets loaded once per process, not loaded for every exec
; there are use cases for exec
, but they're very narrow, it should never be the first, second, third, or even fourth tool you're reaching for)import
guard consistently; when you used import
, you didn't guard the invocation of run_mp
, so the process of launching the workers would try to run run_mp
in each worker. This is a critical part of the multiprocessing
programming guidelines when running on a Windows system (where 'spawn'
is the default and only available start method), and it's a good idea to adhere to it elsewhere (e.g. macOS defaults to 'spawn'
in recent versions of Python, with 'fork'
being available, but dangerous due to non-fork
-safe system libraries, and even on Linux/BSD, some people will choose to use 'spawn'
or 'forkserver'
start methods when the parent process could be huge and the default 'fork'
would potentially consume a lot of memory, possibly crashing the process(es) or triggering the OOM Killer due to excessive virtual memory usage).In short, your final example usage was almost correct (if invoked from a separate script, not an interactive session, not via an IDE). You just need to change it to:
from multiprocess import run_mp
if __name__ == "__main__":
run_mp()