c linux multithreading openmp matrix-multiplication

How to properly use OpenMP?

I'm building a program that should multiplicate two matrices. The question is, I should do it using OpenMP parallelization but on just one "for" loop each time (first the outsider loop, then its child and then the inner loop) and, for each parallelization method, I should analyze the result when using different quantity of threads (1,2,4,8,16,32,64,128).

My question is, where should I put the OpenMP parallel/private sections and what variables should be private/shared to accomplish this ?

// code to be parallelized using n_threads
omp_set_dynamic(0);     // Explicitly disable dynamic teams
omp_set_num_threads(n_threads);


#pragma omp parallel for shared(a, b, c) private(i,j)
for (i=0; i < TAM_MATRIZ; i++){
  for (j=0; j < TAM_MATRIZ; j++) {
    c[i][j] = 0; // initialize the result matrix with zeros
    for (k=0; k < TAM_MATRIZ; k++){
      #pragma omp atomic
      c[i][j] +=  a[i][k]*b[k][j];
    }
  }
  printf("Number of threads used: %d\n", omp_get_num_threads()); 
}

EDIT

In fact, it's three programs, the first one parallelizing just the outsider loop, the second one just the middle loop and the last one parallelizing the inner loop. Each version should run 8 times, using the specified threads (1,2,4,8,16,32,64,128), and then we should compare the performance against the different program versions and same versions using different thread numbers.

My doubt sits where to share or make private it variable. When parallelizing first loop, what variables should be shared ? And when I'm working on the second loop, what variables are shared ? etc...

In my mind, I can't share any variable because I'll have multiple threads working on same time and can make partial results, but I know I'm wrong and I'm asking here basically to understand why .

Solution

You are on the right track - this is actually quite simple.

You missed k in your private clause - this would lead to issues as it is by default shared when defined outside. Instead of explicitly choosing the data sharing for each variable, the best way is to declare variables as locally as possible (e.g. for (int ...), this will almost always be desired and it's easier to reason about. a, b, c, come from the outside and are implicitly shared - the loop variables are declared inside and implicitly private.
Fortunately there is no need for the #pragma omp atomic. Each thread works on a different i - so no two threads could ever try to update the same c[i][j]. Removing the atomic will greatly improve performance. If you ever need atomic, also consider reduction as an alternative.
If you want to print omp_get_num_threads, you should do it outside of the loop, but inside the parallel region. In your case this means you have to split omp parallel for into a omp parallel and omp for. Use omp single to make sure only one thread outputs.

Be aware that very good performance from a matrix multiplication is much more complicated and beyond the scope of this question.

Edit:

For nested loops it is generally better to parallelize the outermost loop if possible - i.e. no data dependency that prevents it. There can be cases where the outermost loop does not yield enough parallelism - in those cases you would rather use collapse(2) to parallelize the outer 2 loops. Do not use (parallel) for twice unless you absolutely know what you are doing. The reason for this is that parallelizing the middle loop yields more smaller pieces of work which increases the relative overhead.

In your specific case one can safely assume TAM_MATRIZ >> n_threads⁰, which means the outermost loop has enough parallel work for all threads to be used efficiently.

To reiterate the data-sharing rules. For a normal parallel region.

Variables defined inside the lexical scope of a parallel region (and parallel loop variables) are implicitly private. Those are variables your threads work on. If a variable is only used within a lexical scope, always¹ define it in the narrow-most possible lexical scope.
Variables defined outside of the lexical scope are implicitly shared by default. Those are variables are typically input/output to the parallel region - so it has to be shared. Make sure to avoid data races.

If you follow this, there is almost never a need to explicitly define the private/shared data-sharing attributes².

⁰ Otherwise it wouldn't even make sense to use OpenMP here.

¹ Exceptions apply for non-trivial C++ types with expensive ctors.

² reduction / firstprivate are useful to be used explicitly.