There are modules which are suited for multiprocessing on clusters, listed here. But I have a script which is already using the multiprocessing
module. This answer states that using this module on a cluster will only let it make processes within a node. But what is this behavior like?
Lets say I have a script called multi.py
which looks something like this:
import multiprocessing as mp
output = mp.Queue()
def square(num, output):
""" example function. square num """
res = num**2
output.put(res)
processes = [mp.Process(target=square, args=(x, output)) for x in range(100000)]
# Run processes
for p in processes:
p.start()
# Exit the completed processes
for p in processes:
p.join()
# Get process results from the output queue
results = [output.get() for p in processes]
print(results)
And I would submit this script to a cluster (for example Sun Grid Engine):
#!/bin/bash
# this script is called run.sh
python multi.py
qsub:
qsub -q short -lnodes=1:ppn=4 run.sh
What would happen? Will python produce processes within the boundary specified in the qsub
command (only on 4 CPU's)? Or will it try to use every CPU on the node?
Your qsub
call gives you 4 processors per node, with 1 node. Thus multiprocessing
is going to be limited to using a maximum of 4 processors.
BTW, if you want to do hierarchical parallel computing: across multiple clusters using sockets or ssh, using MPI and in coordination with cluster schedulers, and using multiprocessing and threading… you might want to have a look at pathos
and it's sister package pyina
(which interacts with MPI and the cluster scheduler).
For example, see: https://stackoverflow.com/questions/28203774/how-to-do-hierarchical-parallelism-in-ipython-parallel
Get pathos
here: https://github.com/uqfoundation