Search code examples
coperating-systemopenmpi

OpenMPI - same rank on different procs


I've been working a bit with OpenMPI, and I'm not getting the expected behavior when requiring ranks from my procs.

I have a simple C program that is supposed to print each proc's rank :

minimal.c :

#include <stdio.h>
#include "mpi.h"

int
main (int argc, char *argv[])
{
    unsigned int procs;
    unsigned int self;
    MPI_Comm com;

    /* MPI ini */
    MPI_Init (&argc, &argv);
    com = MPI_COMM_WORLD;
    MPI_Comm_size (com, &procs);
    MPI_Comm_rank (com, &self);

    printf("My rank is %d\n", self);

    /* MPI Finalize */
    MPI_Finalize();
    return 0;
}

which I compile with :

mpicc minimal.c -o minimal

Now, if I run the following command on my own computer :

mpirun -np 2 minimal

I get the following trace :

$ mpirun -np 2 minimal
My rank is 0
My rank is 0

which I found quite disconcerting.


So, I kept on digging the mpirun manual, and ended up printing additional infos with -display-devel-map and -report-bindings, and this the trace I got :

$ mpirun -np 2 -display-devel-map -report-bindings minimal
 Data for JOB [53858,1] offset 0

 Mapper requested: NULL  Last mapper: round_robin  Mapping policy: BYCORE  Ranking policy: SLOT
 Binding policy: CORE:IF-SUPPORTED  Cpu set: NULL  PPR: NULL  Cpus-per-rank: 1
  Num new daemons: 0  New daemon starting vpid INVALID
  Num nodes: 1

 Data for node: UX31A     Launch id: -1   State: 2
  Daemon: [[53858,0],0]   Daemon launched: True
  Num slots: 2    Slots in use: 2 Oversubscribed: FALSE
  Num slots allocated: 2  Max slots: 0
  Username on node: NULL
  Num procs: 2    Next node_rank: 2
  Data for proc: [[53858,1],0]
      Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
      State: INITIALIZED  App_context: 0
      Locale: [BB/..]
      Binding: [BB/..]
  Data for proc: [[53858,1],1]
      Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
      State: INITIALIZED  App_context: 0
      Locale: [../BB]
      Binding: [../BB]
[UX31A:04861] MCW rank 1 bound to socket 0[core 1[hwt 0-1]]: [../BB]
[UX31A:04861] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/..]
My rank is 0
My rank is 0

which left me puzzled.

I am using Ubuntu 16.04 and the OpenMPI packages from the apt repos. My computer is an Asus UX31a.

I'd be very grateful if someone could give me some insight on what is happening here.

Thank you !


Solution

  • I finally found what was going on thanks to Gilles Gouaillardet !

    Turns out I had mpich libs installed along with openmpi bins !


    Here's what I did :

    1. Check which library was used inside my binary :

      $ ldd minimal ... libmpich.so.12 => /usr/lib/x86_64-linux-gnu/libmpich.so.12 ...

      $ dpkg -S /usr/lib/x86_64-linux-gnu/libmpich.so.12 libmpich12:amd64: /usr/lib/x86_64-linux-gnu/libmpich.so.12.1.0

    2. Check which package provided my mpicc and mpirun binaries :

      $ which mpirun /usr/bin/mpirun

      $ dpkg -S mpirun openmpi-bin: /usr/bin/mpirun.openmpi ...

    3. I removed the mpich packages I had installed

      sudo apt-get remove libmpich12 libmpich-dev

    4. I installed the openmpi libraries I needed

      sudo apt-get install libopenmpi-dev


    I compiled again once this was done :

    $ mpicc minimal.c -o minimal
    $ mpirun -np 2 minimal
    My rank is 0
    My rank is 1
    

    Hurray !