This is a continuation of the thread ipython-with-mpi-clustering-using-machinefile. It is slightly more focused and hopefully clearer as to what the issue might be.
I have 3-nodes running as a cluster using mpich/mpi4py, a machinefile and all libraries in a virtualenv, all on an NFS share. My goal is to use ipython/ipyparallel to distribute jobs across multiple nodes, each running multiple ipython engines.
I am able to run ipcluster start --profile=mpi -n 4
on one node (in this case, worker2
) and through another node (in this case worker1
) run ipython --profile=mpi
and list the engines running on the running using the following commands:
import ipyparallel as ipp
client = ipp.Client()
dview = client[:]
with dview.sync_imports():
import socket
@dview.remote(block=True)
def engine_hostname():
return socket.gethostname()
results = engine_hostname()
for r in results:
print r
As expected, I get 4 instances of the hostname of the host running the engines printed:
In [7]: for r in results:
print r
...:
worker2
worker2
worker2
worker2
However, if I start ipcluster on another node (in this case head
), then those are the only engine/nodes to show up when I query them as outlined above, even though the first set of engines are still running on the other node:
In [7]: for r in results:
print r
...:
head
head
head
head
My question is, how can I get ipython to see all of the engines on all of the nodes that are running; iow, to actually distribute the load across the different nodes.
Running mpi on its own works fine (head, worker1 and worker2 are the respective node sin the cluster):
(venv)gms@head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
head[21506]: 0/10
worker1[7809]: 1/10
head[21507]: 3/10
worker2[8683]: 2/10
head[21509]: 9/10
worker2[8685]: 8/10
head[21508]: 6/10
worker1[7811]: 7/10
worker2[8684]: 5/10
worker1[7810]: 4/10
So, at least I know this is not the problem.
Resolved. I recreated my ipcluster_config.py file and added c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] to it and this time it worked - for some bizarre reason. I could swear I had this in it before, but alas...