Search code examples
cperformanceparallel-processingmpireduction

An efficient way to perform an all reduction in MPI of a value based on another variable?


As an example, lets say I have

int a = ...;
int b = ...;
int c;

where a is the result of some complex local calculation and b is some metric for the quality of a.

I'd like to send the best value of a to every process and store it in c where best is defined by having the largest value of b.

I guess I'm just wondering if there is a more efficient way of doing this than doing an allgather on a and b and then searching through the resulting arrays.

The actual code involves sending and comparing several hundred values on upto several hundred/thousand processes, so any efficiency gains would be welcome.


Solution

  • I guess I'm just wondering if there is a more efficient way of doing this than doing an allgather on a and b and then searching through the resulting arrays.

    This can be achieved with only a single MPI_AllReduce.

    I will present two approaches, a simpler one (suitable for your use case); and a more generic one, for more complex use-cases. The latter will also be useful to show case MPI functionality such as custom MPI Datatypes and custom MPI reduction operators.

    Approach 1

    To represent

    int a = ...;
    int b = ...;
    

    you could use the following struct:

    typedef struct MyStruct {
        int b;
        int a;
    } S;
    

    then you can use the MPI Datatype MPI_2INT and the MPI operator MAXLOC

    The operator MPI_MINLOC is used to compute a global minimum and also an index attached to the minimum value. **MPI_MAXLOC similarly computes a global maximum and index. One application of these is to compute a global minimum (maximum) and the rank of the process containing this value.

    In your case, instead of the rank we will be using the value of 'a'. Hence, the MPI_AllReduce call:

     S  local, global;
     ...
     MPI_Allreduce(&local, &global, 1, MPI_2INT, MPI_MAXLOC, MPI_COMM_WORLD);
    

    The complete code would look like the following:

    #include <stdio.h>
    #include <mpi.h>
    
    typedef struct MyStruct {
        int b;
        int a;
    } S;
    
    
    int main(int argc,char *argv[]){
        MPI_Init(NULL,NULL); // Initialize the MPI environment
        int world_rank; 
        int world_size;
        MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);
        MPI_Comm_size(MPI_COMM_WORLD,&world_size);
        
        // Some fake data
        S local, global;
        local.a = world_rank;
        local.b = world_size - world_rank;
    
        MPI_Allreduce(&local, &global, 1, MPI_2INT, MPI_MAXLOC, MPI_COMM_WORLD);
              
        if(world_rank == 0){
          printf("%d %d\n", global.b, global.a);
        }
    
        MPI_Finalize();
        return 0;
     }
    

    Second Approach

    The MPI_MAXLOC only works for a certain number of predefined datatypes. Nonetheless, for the remaining cases you can use the following approach (based on this SO thread):

    1. Create a struct that will contain the values a and b;
    2. Create a customize MPI_Datatype representing the 1. struct to be sent across processes;
    3. Use MPI_AllReduce:

    int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

    Combines values from all processes and distributes the result back to all processes

    1. Use the operation MAX;

    I'd like to send the best value of 'a' to every process and store it in 'c' where best is defined by having the largest value of 'b'.

    1. Then you have to tell MPI to only consider the element b of the struct. Hence, you need to create a custom MPI_Op max operation.

    Coding the approach

    So let us break step-by-step the aforementioned implementation:

    First define the struct:

    typedef struct MyStruct {
        double a, b;
    } S;
    

    Second create the customize MPI_Datatype:

    void defineStruct(MPI_Datatype *tstype) {
        const int count = 2;
        int          blocklens[count];
        MPI_Datatype types[count];
        MPI_Aint     disps[count];
    
        for (int i=0; i < count; i++){
            types[i] = MPI_DOUBLE;
            blocklens[i] = 1;
        }
        disps[0] = offsetof(S,a);
        disps[1] = offsetof(S,b);
    
        MPI_Type_create_struct(count, blocklens, disps, types, tstype);
        MPI_Type_commit(tstype);
    }
    

    Very Important Note that since we are using a struct you have to be careful with the fact that (source)

    the C standard allows arbitrary padding between the fields.

    So reducing a struct with two doubles is NOT the same as reducing an array with two doubles.

    In the main you have to do:

    MPI_Datatype structtype;
    defineStruct(&structtype);
    

    Third create the custom max operation:

    void max_struct(void *in, void *inout, int *len, MPI_Datatype *type){
        S *invals    = in;
        S *inoutvals = inout;
        for (int i=0; i < *len; i++)
            inoutvals[i].b  = (inoutvals[i].b > invals[i].b) ? inoutvals[i].b  : invals[i].b;
    }
    

    in the main do:

    MPI_Op       maxstruct;
    MPI_Op_create(max_struct, 1, &maxstruct);
    

    Finally, call the MPI_AllReduce:

    S local, global;
    ...
    MPI_Allreduce(&local, &global, 1, structtype, maxstruct, MPI_COMM_WORLD); 
    

    The entire code put together:

    #include <assert.h>
    #include <stdio.h>
    #include <stdlib.h>
    #include <mpi.h>
    
    typedef struct MyStruct {
        double a, b;
    } S;
    
    void max_struct(void *in, void *inout, int *len, MPI_Datatype *type){
        S *invals    = in;
        S *inoutvals = inout;
        for (int i=0; i<*len; i++)
            inoutvals[i].b  = (inoutvals[i].b > invals[i].b) ? inoutvals[i].b  : invals[i].b;
    }
    
    void defineStruct(MPI_Datatype *tstype) {
        const int count = 2;
        int          blocklens[count];
        MPI_Datatype types[count];
        MPI_Aint     disps[count];
    
        for (int i=0; i < count; i++) {
            types[i] = MPI_DOUBLE;
            blocklens[i] = 1;
        }
        disps[0] = offsetof(S,a);
        disps[1] = offsetof(S,b);
    
        MPI_Type_create_struct(count, blocklens, disps, types, tstype);
        MPI_Type_commit(tstype);
    }
    
    int main(int argc,char *argv[]){
        MPI_Init(NULL,NULL); // Initialize the MPI environment
        int world_rank; 
        int world_size;
        MPI_Comm_rank(MPI_COMM_WORLD,&world_rank);
        MPI_Comm_size(MPI_COMM_WORLD,&world_size);
        MPI_Datatype structtype;
        MPI_Op       maxstruct;
        S  local, global;
    
        defineStruct(&structtype);
        MPI_Op_create(max_struct, 1, &maxstruct);
    
        // Just some random values
        local.a = world_rank;
        local.b = world_size - world_rank;
    
        MPI_Allreduce(&local, &global, 1, structtype, maxstruct, MPI_COMM_WORLD);  
              
        if(world_rank == 0){
          double c = global.a;
          printf("%f %f\n", global.b, c);
        }
    
        MPI_Finalize();
        return 0;
     }