i'm using OpenMP for a kNN project. The two parallelized for loops are:
#pragma omp parallel for
for(int ii=0;ii<sizeTest;ii++){
for(int t = 0 ; t<sizeTraining ; t++){
distance[ii][t]=EuclideanDistance(testSet[ii],trainingSet[t]);
}
and
#pragma omp parallel for
for(int ii=0;ii<sizeTest;ii++){
classifyPointFromDistance(trainingSet, &testSet[ii], distance[ii],k,sizeTraining);
}
I tried different combination of scheduling and this are the results:
Serial: 1020 sec
Static (default chunksize) - 4 Threads = 256,28 sec
Dynamic (default chunksize = 1) - 4 Threads = 256,27 sec
I expected that static would be the best since the iterations takes approximately the same time, while the dynamic would introduce too much overhead. This seems not to happen, and i can't understand why. Furthermore, in the static execution, seems like the speed up is linear except in the 16 Threads case:
Serial: 1020 sec
2 Threads: 511,35 sec
4 Threads: 256,28 sec
8 Threads: 128,66 sec
16 Threads: 90,98 sec
24 Threads: 61,4 sec
Why the 16 Threads case differs so much from the others? I'm running the algorithm on a Google VM machine with 24 Threads and 96 GB of ram. Thanks to all.
Why the 16 Threads case differs so much from the others? I'm running the algorithm on a Google VM machine with 24 Threads and 96 GB of ram.
As you have mentioned on the comments:
It's a Intel Xeon CPU @2.30 GHz, 12 physical core
That is the reason that when you moved to 16 thread you stop (almost) linearly scaling, because you are no longer just using physical cores but also logic cores (i.e., hyper-threading).
I expected that static would be the best since the iterations takes approximately the same time, while the dynamic would introduce too much overhead.
Most of the overhead of the dynamic distribution comes from the locking step performed by the threads to acquire the new iteration to work with. It just looks to me that there is not much thread locking contention going on, and even if it is, it is being compensated by better loading balancing achieved with the dynamic scheduler. I have seen this exact pattern before there is not wrong with it.
Aside note you can transform your code into:
#pragma omp parallel
{
#pragma omp for
for(int ii=0; ii<sizeTest; ii++)
for(int t = 0 ; t<sizeTraining ; t++)
distance[ii][t]=EuclideanDistance(testSet[ii],trainingSet[t]);
#pragma omp for nowait
for(int ii=0; ii<sizeTest; ii++){
classifyPointFromDistance(trainingSet, &testSet[ii], distance[ii],k,sizeTraining);
}
To avoid having two parallel regions. Moreover, depending on your code you might be able to further optimize it by adding a nowait clause to both #pragma omp for
, which would remove one implicit barrier. Nonetheless, you need to ensure that this will not result in race-conditions.