c++multithreading performance parallel-processing openmp

Parallelizing a 1D matrix multiplication using OpenMP

I'm new to OpenMP and I'm trying to parallelize a 1D matrix multiplication (only multiplying the upper triangle of the matrix). My solution now is fast, but it's currently only using approximately half of the available threads, so I'm supposed to further parallelize it but I don't know where to start (I know saving data in the result array is the most time consuming, this array is external and can only save data into it).

This is the code (I'm multiplying and transposing at once):

#pragma omp parallel
{
#pragma omp for
for (int j=0; j < ny; ++j){
    for (int i=0; i < ny; ++i){
        if (i < j){
            continue;
        }
        double mat_mul_sum{0.};
        for (int k=0; k < nx; ++k){
             double mat_mul = new_data[k + j*nx] * new_data[k + i*nx];
             mat_mul_sum += mat_mul;
        }
        array_results[i + j*ny] = mat_mul_sum;
    }
}
}

Help is greatly appreciated.

Solution

First of all, you can start by removing the continue in the loop and start the i-based loop at the value j. Then, you can focus on the load-balancing issue between the threads.

One simple solution to balance the work between threads is to work simultaneously on the beginning and the end of the matrix (in each thread). Visually, it is like cutting a corner of the triangle and moving it to form a rectangle of size ny x (ny/2). Note that the case where ny is odd should be carefully considered (as the shape will not be a pure rectangle).

Finally, you can optimize the k-loop using an SIMD reduction.

Here is what the resulting code should look like (untested):

#pragma omp parallel
{
    #pragma omp for
    for (int j=0; j < (ny+1)/2; ++j)
    {
        const int jUp = j;
        const int jDown = ny - 1 - j;

        for (int i=jUp; i < ny; ++i)
        {
            double mat_mul_sum{0.};
            #pragma omp simd reduction(+:mat_mul_sum)
            for (int k=0; k < nx; ++k){
                 double mat_mul = new_data[k + jUp*nx] * new_data[k + i*nx];
                 mat_mul_sum += mat_mul;
            }
            array_results[i + jUp*ny] = mat_mul_sum;
        }

        // Compute the center only once
        if(jUp != jDown)
        {
            for (int i=jDown; i < ny; ++i)
            {
                double mat_mul_sum{0.};
                #pragma omp simd reduction(+:mat_mul_sum)
                for (int k=0; k < nx; ++k){
                     double mat_mul = new_data[k + jDown*nx] * new_data[k + i*nx];
                     mat_mul_sum += mat_mul;
                }
                array_results[i + jDown*ny] = mat_mul_sum;
            }
        }
    }
}