c++multithreading performance parallel-processing openmp

Parallelizing for-loop and merging the thread private variables

I am a bit confused on how I can use OpenMP multithreading to parallelize this for-loop im working with. In this program section of the program I try to receive data from the arrays x and y; defined as:

x = (float*)aligned_alloc(32, sizeof(float) * n);
y = (float*)aligned_alloc(32, sizeof(float) * n);

Where n is a int bigger than 0 and devidable by 16 And then saving the largest/smallest x/y values in the vectors maxX, minX, minY and maxY. n can be however large as possible, but when testing I have had n as 360000000.

The for-loop I am trying to parellelize with multithreading is

for(int i = 8; i < n; i+= 8 ){
    __m256 vx = _mm256_load_ps(&x[i]);
    __m256 vy = _mm256_load_ps(&y[i]);

    minX = _mm256_min_ps(minX, vx);
    maxX = _mm256_max_ps(maxX, vx);

    minY = _mm256_min_ps(minY, vy);
    maxY = _mm256_max_ps(maxY, vy);

}

Where minX, maxX, minY and maxY are _m256 vectors filled with zeros.

So far my tries have lead me to letting each thread have it's own private temporary variable's for minX, maxX, minY and maxY, that gets handeled in the for-loop and then trying to merge the private variables into the shared variables that is going to be used in the rest of the program, like this:

    #pragma omp parallel num_threads(4)
    {
        __m256 TempMinX = _mm256_load_ps(&x[0]); //creaing private variables for each thread
        __m256 TempMaxX = _mm256_load_ps(&x[0]);
        __m256 TempMinY = _mm256_load_ps(&x[0]);
        __m256 TempMaxY = _mm256_load_ps(&x[0]);

        #pragma omp for
        for (int i = 8; i < n; i += 8) {
                            
            __m256 vx = _mm256_load_ps(&x[i]); //loads the values from the x array
            __m256 vy = _mm256_load_ps(&y[i]); //loads the values from the y array

                TempMinX = _mm256_min_ps(TempMinX, vx); 
                TempMaxX = _mm256_max_ps(TempMaxX, vx);
                TempMinY = _mm256_min_ps(TempMinY, vy);
                TempMaxY = _mm256_max_ps(TempMaxY, vy);

            }
            /*section to merge thread private variables into
              the shared variables by comparing the 
              values in vector minX with the threads private 
              vector Temp and saving the smalles/largest         
              values in the shared vector: */

                #pragma omp critical
                minX = _mm256_min_ps(minX, TempMinX);
                #pragma omp critical
                maxX = _mm256_max_ps(maxX, TempMaxX);
                #pragma omp critical
                minY = _mm256_min_ps(minY, TempMinY);
                #pragma omp critical
                maxY = _mm256_max_ps(maxY, TempMaxY);

When running this program and comparing it to the "un-parallelized" program the "un-parallelized" program runs faster than my "parallelized" one. I suspect this might have something to do with "merging" section where the different threads have to wait for the other threads to access the shared variables before they can write to it, but so far I have not found/came up with any good solution on how to solve this and make it run faster...

Solution

Most likely the problem is the overhead of the critical regions.

        #pragma omp critical
        minX = _mm256_min_ps(minX, TempMinX);
        #pragma omp critical
        maxX = _mm256_max_ps(maxX, TempMaxX);
        #pragma omp critical
        minY = _mm256_min_ps(minY, TempMinY);
        #pragma omp critical
        maxY = _mm256_max_ps(maxY, TempMaxY);

Which one can reduce to only one critical region for the entire block:

        #pragma omp critical
        {
            minX = _mm256_min_ps(minX, TempMinX);
            maxX = _mm256_max_ps(maxX, TempMaxX);
            minY = _mm256_min_ps(minY, TempMinY);
            maxY = _mm256_max_ps(maxY, TempMaxY);
        }

or by having named critical regions, namely:

        #pragma omp critical(region1)
        minX = _mm256_min_ps(minX, TempMinX);
        #pragma omp critical(region2)
        maxX = _mm256_max_ps(maxX, TempMaxX);
        #pragma omp critical(region3)
        minY = _mm256_min_ps(minY, TempMinY);
        #pragma omp critical(region4)
        maxY = _mm256_max_ps(maxY, TempMaxY);

In this manner, one can have multiple threads concurrently executing different named critical regions.

Try it out both version to see which one has the least overhead.

Another approach, and probably the approach with better performance, you can add those private variables into an array that will be pick up by the master thread outside the parallel region:

// create a arrays of size equal to the number of threads
#pragma omp parallel num_threads(4)
{
    #pragma omp for
    for (int i = 8; i < n; i += 8) {
         ...
    }
    int threadID = omp_get_thread_num();
    array_minX[threadID] = TempMinX;
    array_maxX[threadID] = TempMaxX;
    array_minY[threadID] = TempMinY;
    array_maxY[threadID] = TempMaxY;
 }
 // the master thread calculate the _mm256_min_ps of the array array_minX, and so on.

Lastly, you can use OpenMP 4.0 User-defined Reductions to create your own reduction, which is basically what the aforementioned approach is doing but without using the OpenMP inbuilt features.