c multithreading performance parallel-processing openmp

Optimizing a matrix transpose function with OpenMP

I have this code that transposes a matrix using loop tiling strategy.

void transposer(int n, int m, double *dst, const double *src) {
   int blocksize;
   for (int i = 0; i < n; i += blocksize) {
      for (int j = 0; j < m; j += blocksize) {
          // transpose the block beginning at [i,j]
          for (int k = i; k < i + blocksize; ++k) {
              for (int l = j; l < j + blocksize; ++l) {
                  dst[k + l*n] = src[l + k*m];
               }
          }
      }
   }
}

I want to optimize this with multi-threading using OpenMP, however I am not sure what to do when having so many nested for loops. I thought about just adding #pragma omp parallel for but doesn't this just parallelize the outer loop?

Solution

When you try to parallelize a loop nest, you should ask yourself how many levels are conflict free. As in: every iteration writing to a different location. If two iterations write (potentially) to the same location, you need to 1. use a reduction 2. use a critical section or other synchronization 3. decide that this loop is not worth parallelizing, or 4. rewrite your algorithm.

In your case, the write location depends on k,l. Since k<n and l*n, there are no pairs k.l / k',l' that write to the same location. Furthermore, there are no two inner iterations that have the same k or l value. So all four loops are parallel, and they are perfectly nested, so you can use collapse(4).

You could also have drawn this conclusion by considering the algorithm in the abstract: in a matrix transposition each target location is written exactly once, so no matter how you traverse the target data structure, it's completely parallel.