multithreading performance parallel-processing openmp knn

Speed up and scheduling with OpenMP

i'm using OpenMP for a kNN project. The two parallelized for loops are:

#pragma omp parallel for
 for(int ii=0;ii<sizeTest;ii++){
    for(int t = 0 ; t<sizeTraining ; t++){
        distance[ii][t]=EuclideanDistance(testSet[ii],trainingSet[t]);
    }

and

#pragma omp parallel for
for(int ii=0;ii<sizeTest;ii++){          
   classifyPointFromDistance(trainingSet, &testSet[ii], distance[ii],k,sizeTraining);
}

I tried different combination of scheduling and this are the results:

Serial: 1020 sec

Static (default chunksize) - 4 Threads = 256,28 sec

Dynamic (default chunksize = 1) - 4 Threads = 256,27 sec

I expected that static would be the best since the iterations takes approximately the same time, while the dynamic would introduce too much overhead. This seems not to happen, and i can't understand why. Furthermore, in the static execution, seems like the speed up is linear except in the 16 Threads case:

Serial: 1020 sec

2 Threads: 511,35 sec

4 Threads: 256,28 sec

8 Threads: 128,66 sec

16 Threads: 90,98 sec

24 Threads: 61,4 sec

Why the 16 Threads case differs so much from the others? I'm running the algorithm on a Google VM machine with 24 Threads and 96 GB of ram. Thanks to all.

Solution

Why the 16 Threads case differs so much from the others? I'm running the algorithm on a Google VM machine with 24 Threads and 96 GB of ram.

As you have mentioned on the comments:

It's a Intel Xeon CPU @2.30 GHz, 12 physical core

That is the reason that when you moved to 16 thread you stop (almost) linearly scaling, because you are no longer just using physical cores but also logic cores (i.e., hyper-threading).

I expected that static would be the best since the iterations takes approximately the same time, while the dynamic would introduce too much overhead.

Most of the overhead of the dynamic distribution comes from the locking step performed by the threads to acquire the new iteration to work with. It just looks to me that there is not much thread locking contention going on, and even if it is, it is being compensated by better loading balancing achieved with the dynamic scheduler. I have seen this exact pattern before there is not wrong with it.

Aside note you can transform your code into:

#pragma omp parallel
{
    #pragma omp for
    for(int ii=0; ii<sizeTest; ii++)
        for(int t = 0 ; t<sizeTraining ; t++)
            distance[ii][t]=EuclideanDistance(testSet[ii],trainingSet[t]);
      

    #pragma omp for nowait
    for(int ii=0; ii<sizeTest; ii++){          
        classifyPointFromDistance(trainingSet, &testSet[ii], distance[ii],k,sizeTraining);
}

To avoid having two parallel regions. Moreover, depending on your code you might be able to further optimize it by adding a nowait clause to both #pragma omp for, which would remove one implicit barrier. Nonetheless, you need to ensure that this will not result in race-conditions.