Search code examples
mpiopenmpi

MPI_Wtime timer runs about 2 times faster in OpenMPI 2.0.2


After updating OpenMPI from 1.8.4 to 2.0.2 I ran into erroneous time measurement using MPI_Wtime(). With version 1.8.4 the result was the same as returned by omp_get_wtime() timer, and now MPI_Wtime runs about 2 times faster.

What can cause such a behaviour?

My sample code:

#include <omp.h>
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>

int some_work(int rank, int tid){
  int count = 10000;
  int arr[count];
  for( int i=0; i<count; i++)
    arr[i] = i + tid + rank;
  for( int val=0; val<4000000; val++)
    for(int i=0; i<count-1; i++)
      arr[i] = arr[i+1];

  return arr[0];
}


int main (int argc, char *argv[]) {

  MPI_Init(NULL, NULL);
  int rank, size;

  MPI_Comm_size(MPI_COMM_WORLD, &size);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  if (rank == 0)
    printf("there are %d mpi processes\n", size);

  MPI_Barrier(MPI_COMM_WORLD);

  double omp_time1 = omp_get_wtime();
  double mpi_time1 = MPI_Wtime();
  #pragma omp parallel 
  {
    int tid = omp_get_thread_num();
    if ( tid == 0 ) {
      int nthreads = omp_get_num_threads();
      printf("There are %d threads for process %d\n", nthreads, rank);
      int result = some_work(rank, tid);
      printf("result for process %d thread %d is %d\n", rank, tid, result);
    }
  }

  MPI_Barrier(MPI_COMM_WORLD);
  double mpi_time2 = MPI_Wtime();
  double omp_time2 = omp_get_wtime();
  printf("process %d omp time: %f\n", rank, omp_time2 - omp_time1);
  printf("process %d mpi time: %f\n", rank,  mpi_time2 - mpi_time1);
  printf("process %d ratio: %f\n", rank, (mpi_time2 - mpi_time1)/(omp_time2 - omp_time1) );

  MPI_Finalize();

  return EXIT_SUCCESS;
}

Compiling

g++ -O3 src/example_main.cpp -o bin/example -fopenmp -I/usr/mpi/gcc/openmpi-2.0.2/include -L /usr/mpi/gcc/openmpi-2.0.2/lib -lmpi

And running

salloc -N2 -n2 mpirun --map-by ppr:1:node:pe=16 bin/example 

Gives something like

there are 2 mpi processes
There are 16 threads for process 0
There are 16 threads for process 1
result for process 1 thread 0 is 10000
result for process 0 thread 0 is 9999
process 1 omp time: 5.066794
process 1 mpi time: 10.098752
process 1 ratio: 1.993125
process 0 omp time: 5.066816
process 0 mpi time: 8.772390
process 0 ratio: 1.731342

The ratio is not consistent as I wrote first but still large enough.

Results for OpenMPI 1.8.4 are OK:

g++ -O3 src/example_main.cpp -o bin/example -fopenmp -I/usr/mpi/gcc/openmpi-1.8.4/include -L /usr/mpi/gcc/openmpi-1.8.4/lib -lmpi -lmpi_cxx

Gives

result for process 0 thread 0 is 9999
result for process 1 thread 0 is 10000
process 0 omp time: 4.655244
process 0 mpi time: 4.655232
process 0 ratio: 0.999997
process 1 omp time: 4.655335
process 1 mpi time: 4.655321
process 1 ratio: 0.999997

Solution

  • I've got similar behavior on my cluster (same OpenMPI version as yours, 2.0.2) and the problem was the default governor for the CPU frequencies, the 'conservative' one. Once set the governor to 'performance', output of MPI_Wtime() aligned with the correct timings (output of 'time', in my case). It appears that, for some older Xeon processors (like the Xeon E5620), some clocking function becomes skewed when too aggressive policies for dynamic frequency adjustment are used - the same OpenMPI version does not suffer from this problem on newer Xeons within the same cluster.