Search code examples
mpidistributed-computingopenmpi

Using rankfiles with OpenMPI


I am trying to use MPI in a cluster and would like to be able to control which ranks get scheduled in which nodes.

Note: I am using OpenMPI 2.1.0.

For that I am using a rankfile. If I use the following rankfile:

ubuntu@ip-172-31-8-16:~/dist_log_reg$ cat rankfile 
rank 0=localhost slots=1
rank 1=54.153.103.12 slots=1

I get:

ubuntu@ip-172-31-8-16:~/dist_log_reg$ mpirun -v -np 1 -rankfile rankfile hostname
--------------------------------------------------------------------------
The rankfile that was used claimed that a host was either not
allocated or oversubscribed its slots.  Please review your rank-slot
assignments and your host allocation to ensure a proper match.  Also,
some systems may require using full hostnames, such as
"host1.example.com" (instead of just plain "host1").

  Host: ip-172-31-8-16

If I use only one entry in the rankfile:

ubuntu@ip-172-31-8-16:~/dist_log_reg$ cat rankfile 
rank 0=localhost slots=1

I get:

ubuntu@ip-172-31-8-16:~/dist_log_reg$ mpirun -v -np 1 -rankfile rankfile hostname
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.

I have tried everything I can think of (e.g., installing other distributions of MPI and trying different options in the rankfile) but haven't been able to make this work.

Any ideas?


Solution

  • I managed to create your error with passing localhost as hostname. But When I use the actual system-name I managed to run it.

    rank X=myPC slot=Y
    

    I believe that Open MPI probes the hostnames and performs a gethostname call.