Search code examples
cparallel-processingopenmp

Wrong result with OpenMP to parallelize GEMM


I know OpenMP shares all variables declared in an outer scope between all workers. And that my be the answer of my question. But I really confused why function omp3 delivers right result while function omp2 delivers a wrong result.

void omp2(double *A, double *B, double *C, int m, int k, int n) {
    for (int i = 0; i < m; ++i) {
#pragma omp parallel for
        for (int ki = 0; ki < k; ++ki) {
            for (int j = 0; j < n; ++j) {
                C[i * n + j] += A[i * k + ki] * B[ki * n + j];
            }
        }
    }
}
void omp3(double *A, double *B, double *C, int m, int k, int n) {
    for (int i = 0; i < m; ++i) {
        for (int ki = 0; ki < k; ++ki) {
#pragma omp parallel for
            for (int j = 0; j < n; ++j) {
                C[i * n + j] += A[i * k + ki] * B[ki * n + j];
            }
        }
    }
}

Solution

  • The problem is that there is a race condition in this line:

    C[i * n + j] += ...
    

    Different threads can read and write the same memory location (C[i * n + j]) simultaneously, which causes data race. In omp2 this data race can occur, but not in omp3.

    The solution is (as suggested by @Victor Eijkhout) is to reorder your loops, use a local variable to calculate the sum of the innermost loop. In this case C[i * n + j] is updated only once, so you got rid of data race and the outermost loop can be parallelized (which gives the best performance):

       #pragma omp parallel for
        for (int i = 0; i < m; ++i) {
            for (int j = 0; j < n; ++j) {
                double sum=0;            
                for (int ki = 0; ki < k; ++ki) {            
                    sum += A[i * k + ki] * B[ki * n + j];
                }
                C[i * n + j] +=sum;
            }        
        }
    
    

    Note that you can use collapse(2) clause, which may increase the performance.