Search code examples
ipythoncluster-computingmpichipython-parallelmpi4py

Accessing multiple nodes in a mpi cluster using ipython


This is a continuation of the thread ipython-with-mpi-clustering-using-machinefile. It is slightly more focused and hopefully clearer as to what the issue might be.

I have 3-nodes running as a cluster using mpich/mpi4py, a machinefile and all libraries in a virtualenv, all on an NFS share. My goal is to use ipython/ipyparallel to distribute jobs across multiple nodes, each running multiple ipython engines.

I am able to run ipcluster start --profile=mpi -n 4 on one node (in this case, worker2) and through another node (in this case worker1) run ipython --profile=mpi and list the engines running on the running using the following commands:

import ipyparallel as ipp 

client = ipp.Client()
dview  = client[:]

with dview.sync_imports():
    import socket

@dview.remote(block=True)
def engine_hostname():
    return socket.gethostname()

results = engine_hostname()
for r in results:
    print r

As expected, I get 4 instances of the hostname of the host running the engines printed:

In [7]: for r in results:
        print r
   ...:
worker2
worker2
worker2
worker2

However, if I start ipcluster on another node (in this case head), then those are the only engine/nodes to show up when I query them as outlined above, even though the first set of engines are still running on the other node:

In [7]: for r in results:
            print r
       ...:
    head
    head
    head
    head

My question is, how can I get ipython to see all of the engines on all of the nodes that are running; iow, to actually distribute the load across the different nodes.

Running mpi on its own works fine (head, worker1 and worker2 are the respective node sin the cluster):

(venv)gms@head:~/development/mpi$ mpiexec -f machinefile -n 10 ipython test.py
head[21506]: 0/10
worker1[7809]: 1/10
head[21507]: 3/10
worker2[8683]: 2/10
head[21509]: 9/10
worker2[8685]: 8/10
head[21508]: 6/10
worker1[7811]: 7/10
worker2[8684]: 5/10
worker1[7810]: 4/10

So, at least I know this is not the problem.


Solution

  • Resolved. I recreated my ipcluster_config.py file and added c.MPILauncher.mpi_args = ["-machinefile", "path_to_file/machinefile"] to it and this time it worked - for some bizarre reason. I could swear I had this in it before, but alas...