I wrote the below code which executes a certain number of binary searches in an array. I parallelized it with OpenMP, and it seems like that the more I add threads, the more it takes time to finish.
The program takes as args the length of the array in which the Bsearch is applied and the length of the search
array, in which the values to be searched in the first array are initialized. The parallelization is applied in all three for loops.
I run this program on a HPC cluster, on a single node with 20 cores, with the following script:
for threads in 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ; do
export OMP_NUM_THREADS=${threads}
./binary_search_parallel.x 1000000000 100000000
done
My problem is that the program doesn't scale at all: the more I add threads, the more time it takes. The serial version performs way better. Does anybody know where is the problem? Or maybe the fact is that there is not enough performance throughput against the parallel overhead?
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <string.h>
#include <omp.h>
#define CPU_TIME (clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &ts ), (double)ts.tv_sec + \
(double)ts.tv_nsec * 1e-9)
int mybsearch(int *data, int low, int high, int Key)
{
int register mid;
mid = (low + high) / 2;
while(low <= high) {
if(data[mid] < Key)
low = mid + 1;
else if(data[mid] > Key)
high = mid - 1;
else
return mid;
mid = (low + high) / 2;
}
/* if ( Key == data[low] ) */
/* return 0; */
/* else */
return -1;
}
#define N_DEFAULT (1024*1024*128)
#define N_search_DEFAULT (N_DEFAULT / 10)
int main(int argc, char **argv)
{
int N, Nsearch, i, n_threads = 1;
int *data, *search;
#ifndef _OPENMP
printf("serial binary search\n");
#else
#pragma omp parallel
{
#pragma omp master
{
n_threads = omp_get_num_threads();
printf("omp binary search with %d threads\n", n_threads );
}
}
#endif
if(argc > 1)
N = atoi( *(argv+1) );
else
N = N_DEFAULT;
if(argc > 2)
Nsearch = atoi ( *(argv + 2) );
else
Nsearch = N_search_DEFAULT;
printf("performing %d lookups on %d data..\n", Nsearch, N);
printf("set-up data.."); fflush(stdout);
data = (int*)malloc(N * sizeof(int));
#if defined(_OPENMP)
#pragma omp parallel for
for (i = 0; i < N; i++)
data[i] = i;
#else
for(i = 0; i < N; i++)
data[i] = i;
#endif
printf(" set-up lookups.. "); fflush(stdout);
search = (int*)malloc(Nsearch * sizeof(int));
srand(time(NULL));
#if defined(_OPENMP)
#pragma omp parallel for
for (i = 0; i < Nsearch; i++)
search[i] = rand() % N;
#else
for (i = 0; i < N; i++)
search[i] = rand() % N;
#endif
int found = 0;
double tstart, tstop;
struct timespec ts;
printf("\nstart cycle.. "); fflush(stdout);
tstart = CPU_TIME;
#if defined(_OPENMP)
#pragma omp parallel for
for (i = 0; i < Nsearch; i++)
if( mybsearch(data, N, search[i]) >= 0)
found++;
#else
for ( i = 0; i < Nsearch; i++)
if(mybsearch(data, N, search[i]) >= 0)
found++;
#endif
tstop = CPU_TIME;
printf("time elapsed: %g\n", tstop - tstart);
//free(data);
//free(search);
return 0;
}
The 20 hardware threads are from same socket? Do your machine have NUMA (non-uniform memory access) architecture?
Maybe this could be your bottleneck: timing from memory accesses. If your machine is a NUMA, you can pay a lot in execution time due to bad memory location, once you initialize the data in parallel.
In tests with your code on a 48-core NUMA machine (8 NUMA nodes x 6 cores), it presents bad scalability if
Here some timings in seconds for 10000000 10000000
parameters:
OMP_PLACES=cores OMP_PROC_BIND=close
)You can notice that there is an increase in seconds each time a new NUMA node is included (7, 13, 19, 25, 31, 37 and 43 threads). The times are lower in average from second parallel solution to first because in second solution we have a little control about how many NUMA nodes we are using (due to thread pinning), reducing the chance of a thread migrate to another NUMA node too far from node where the data is actually located.