c performance parallel-processing openmp matrix-multiplication

Matrix Multiplication using OpenMP (C) - Collapsing all the loops

So I was learning about the basics OpenMP in C and work-sharing constructs, particularly for loop. One of the most famous examples used in all the tutorials is of matrix multiplication but all of them just parallelize the outer loop or the two outer loops. I was wondering why we do not parallelize and collapse all the 3 loops (using atomic) as I have done here:

    for(int i=0;i<100;i++){
        //Initialize the arrays
        for(int j=0;j<100;j++){
            A[i][j] = i;
            B[i][j] = j;
            C[i][j] = 0;       
        }       
    }

    //Starting the matrix multiplication
    #pragma omp parallel num_threads(4)
    {
        #pragma omp for collapse(3)
        for(int i=0;i<100;i++){
            for(int j=0;j<100;j++){
                for(int k=0;k<100;k++){
                        #pragma omp atomic
                        C[i][j] = C[i][j]+ (A[i][k]*B[k][j]);
                }       
            }       
        }   
    }

Can you tell me what I am missing here or why is this not an inferior/superior solution?

Solution

Atomic operations are very costly on most architectures compared to non-atomic ones (see here to understand why or here for a more detailed analysis). This is especially true when many threads make concurrent accesses to the same shared memory area. To put it simply, one cause is that threads performing atomic operations cannot fully run in parallel without waiting others on most hardware due to implicit synchronizations and communications coming from the cache coherence protocol. Another source of slowdowns is the high-latency of atomic operations (again due to the cache hierarchy).

If you want to write code that scale well, you need to minimize synchronizations and communications (including atomic operations). As a result, using collapse(2) is much better than a collapse(3). But this is not the only issue is your code. Indeed, to be efficient you must perform memory accesses continuously and keep data in caches as much as possible.

For example, swapping the loop iterating over i and the one iterating over k (that does not work with collapse(2)) is several times faster on most systems due to more contiguous memory accesses (about 8 times on my PC):

    for(int i=0;i<100;i++){
        //Initialize the arrays
        for(int j=0;j<100;j++){
            A[i][j] = i;
            B[i][j] = j;
            C[i][j] = 0;       
        }       
    }

    //Starting the matrix multiplication
    #pragma omp parallel num_threads(4)
    {
        #pragma omp for
        for(int i=0;i<100;i++){
            for(int k=0;k<100;k++){
                for(int j=0;j<100;j++){
                    C[i][j] = C[i][j] + (A[i][k]*B[k][j]);
                }
            }
        }
    }

Writing fast matrix-multiplication code is not easy. Consider using BLAS libraries such as OpenBLAS, ATLAS, Eigen or Intel MKL rather than writing your own code if your goal is to use this in production code. Indeed, such libraries are very optimized and often scale well on many cores. If your goal is to understand how to write efficient matrix-multiplication codes, a good starting point may be to read this tutorial.