c multithreading parallel-processing openmp hpc

Parallel Binary Search performs worse than the serial version

I wrote the below code which executes a certain number of binary searches in an array. I parallelized it with OpenMP, and it seems like that the more I add threads, the more it takes time to finish. The program takes as args the length of the array in which the Bsearch is applied and the length of the search array, in which the values to be searched in the first array are initialized. The parallelization is applied in all three for loops.

I run this program on a HPC cluster, on a single node with 20 cores, with the following script:

for threads in 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ; do
    export OMP_NUM_THREADS=${threads}
    ./binary_search_parallel.x 1000000000 100000000
done

My problem is that the program doesn't scale at all: the more I add threads, the more time it takes. The serial version performs way better. Does anybody know where is the problem? Or maybe the fact is that there is not enough performance throughput against the parallel overhead?


#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <omp.h>

#define CPU_TIME (clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &ts ), (double)ts.tv_sec + \
          (double)ts.tv_nsec * 1e-9)


int mybsearch(int *data, int low, int high, int Key)
 {

   int register mid;

   mid = (low + high) / 2;
   while(low <= high) {     


     if(data[mid] < Key)
       low = mid + 1; 
     else if(data[mid] > Key)
       high = mid - 1;
     else 
       return mid;

     mid = (low + high) / 2;
   }

   /* if ( Key == data[low] ) */
   /*   return 0; */
   /* else */
     return -1;
 }

#define N_DEFAULT  (1024*1024*128)
#define N_search_DEFAULT (N_DEFAULT / 10)

int main(int argc, char **argv)
{
  int N, Nsearch, i, n_threads = 1;
  int *data, *search;

  #ifndef _OPENMP
    printf("serial binary search\n");
  #else
  #pragma omp parallel
  {
    #pragma omp master
    {
      n_threads = omp_get_num_threads();
      printf("omp binary search with %d threads\n", n_threads );
    }
  }
  #endif

  if(argc > 1)
    N = atoi( *(argv+1) );
  else
    N = N_DEFAULT;

  if(argc > 2)
    Nsearch = atoi ( *(argv + 2) );
  else
    Nsearch = N_search_DEFAULT;

  printf("performing %d lookups on %d data..\n", Nsearch, N);

  printf("set-up data.."); fflush(stdout);
  data = (int*)malloc(N * sizeof(int));

  #if defined(_OPENMP)
   #pragma omp parallel for
      for (i = 0; i < N; i++)
        data[i] = i;
  #else
    for(i = 0; i < N; i++)
      data[i] = i;
  #endif

  printf(" set-up lookups.. "); fflush(stdout);  
  search = (int*)malloc(Nsearch * sizeof(int));
  srand(time(NULL));

  #if defined(_OPENMP)
    #pragma omp parallel for
      for (i = 0; i < Nsearch; i++)
        search[i] = rand() % N;
  #else
    for (i = 0; i < N; i++)
      search[i] = rand() % N;
  #endif


  int found = 0;
  double tstart, tstop;
  struct timespec ts;

  printf("\nstart cycle.. "); fflush(stdout);
  tstart = CPU_TIME;

  #if defined(_OPENMP)
    #pragma omp parallel for
      for (i = 0; i < Nsearch; i++)
        if( mybsearch(data, N, search[i]) >= 0)
          found++;
  #else
    for ( i = 0; i < Nsearch; i++)
      if(mybsearch(data, N, search[i]) >= 0)
        found++;
  #endif

  tstop = CPU_TIME;

  printf("time elapsed: %g\n", tstop - tstart);

  //free(data);
  //free(search);

  return 0;
 }

Solution

The 20 hardware threads are from same socket? Do your machine have NUMA (non-uniform memory access) architecture?

Maybe this could be your bottleneck: timing from memory accesses. If your machine is a NUMA, you can pay a lot in execution time due to bad memory location, once you initialize the data in parallel.

In tests with your code on a 48-core NUMA machine (8 NUMA nodes x 6 cores), it presents bad scalability if

You do not pin threads to cores (if using a number of threads less or equal to number of cores in a single socket)
You use more than one NUMA memory bank. The memory access pattern of your code is very irregular, the data can be in any NUMA node.

Here some timings in seconds for 10000000 10000000 parameters:

Serial: ~6,57
Pinned (with taskset) Serial: ~5,27
Parallel

Pinned Parallel (OMP_PLACES=cores OMP_PROC_BIND=close)

You can notice that there is an increase in seconds each time a new NUMA node is included (7, 13, 19, 25, 31, 37 and 43 threads). The times are lower in average from second parallel solution to first because in second solution we have a little control about how many NUMA nodes we are using (due to thread pinning), reducing the chance of a thread migrate to another NUMA node too far from node where the data is actually located.