c multithreading performance parallel-processing openmp

OpenMP impact on performance

I am trying to parallelize a script using openMP, but when I measure the execution time of it (using omp_get_thread_num) the results are preety odd,

if I set the number of threads to 2 it measures 4935 us
setting it to 1 takes around 1083 us
and removing every openmp directive turns that into only 9 us

Here's the part of the script I'm talking about (this loop is nested inside another one)

for(j=(i-1); j>=0;j--){
   a=0;
   #pragma omp parallel
   {    
       #pragma omp single
       {
           if(arreglo[j]>y){
              arreglo[j+2]=arreglo[j];
            }
             else if(arreglo[j]>x){
                  if(!flag[1]){
                     arreglo[j+2]=y;
                     flag[1]=1;
                   }
                arreglo[j+1]=arreglo[j];
                }
             }
             #pragma omp single
             {
                if(arreglo[j]<=x){
                   arreglo[j+1]=x;
                   flag[0]=1;
                   a=1;
             }
      }
    #pragma omp barrier
    }
    if (a==1){break;}
}

What could be the cause of this differences? some sort of bottleneck, or it's just the added cost of sychronization ?

Solution

We are talking about a really short execution time, which can be easily affected by the environment used for the benchmark;
You are clearly using an input size that does not justify the overhead of the parallelism.;
Your current design only allows for 2 threads; no room for scaling;
Instead of using the single constructor, you might as well just statically divide those two code branches based upon the thread ID, you would save the overhead of the single constructor;
That last barrier is redundant since the #pragma omp parallel has already an implicit barrier at the of it.

Furthermore, your code just looks intrinsically sequential, and with the current design, the code is clearly not suitable for parallelism.

if i set the number of threads to 2 it measures 4935 us setting it to 1 takes around 1083 us and removing every openmp directive turns that into only 9 us

With 2 threads you are paying all that synchronization overhead, with 1 thread you are paying the price of having the openMP there. Finally, without the parallelization, you just removed all that overhead, hence the lower execution time.

Btw you do not need to remove the OpenMP directives, just compile the code without -fopenmp flag, and the directives will be ignored.