I'm building a program that should multiplicate two matrices. The question is, I should do it using OpenMP parallelization but on just one "for" loop each time (first the outsider loop, then its child and then the inner loop) and, for each parallelization method, I should analyze the result when using different quantity of threads (1,2,4,8,16,32,64,128).
My question is, where should I put the OpenMP parallel/private sections and what variables should be private/shared to accomplish this ?
// code to be parallelized using n_threads
omp_set_dynamic(0); // Explicitly disable dynamic teams
omp_set_num_threads(n_threads);
#pragma omp parallel for shared(a, b, c) private(i,j)
for (i=0; i < TAM_MATRIZ; i++){
for (j=0; j < TAM_MATRIZ; j++) {
c[i][j] = 0; // initialize the result matrix with zeros
for (k=0; k < TAM_MATRIZ; k++){
#pragma omp atomic
c[i][j] += a[i][k]*b[k][j];
}
}
printf("Number of threads used: %d\n", omp_get_num_threads());
}
EDIT
In fact, it's three programs, the first one parallelizing just the outsider loop, the second one just the middle loop and the last one parallelizing the inner loop. Each version should run 8 times, using the specified threads (1,2,4,8,16,32,64,128), and then we should compare the performance against the different program versions and same versions using different thread numbers.
My doubt sits where to share or make private it variable. When parallelizing first loop, what variables should be shared ? And when I'm working on the second loop, what variables are shared ? etc...
In my mind, I can't share any variable because I'll have multiple threads working on same time and can make partial results, but I know I'm wrong and I'm asking here basically to understand why .
You are on the right track - this is actually quite simple.
You missed k
in your private
clause - this would lead to issues as it is by default shared
when defined outside. Instead of explicitly choosing the data sharing for each variable, the best way is to declare variables as locally as possible (e.g. for (int ...)
, this will almost always be desired and it's easier to reason about. a
, b
, c
, come from the outside and are implicitly shared
- the loop variables are declared inside and implicitly private
.
Fortunately there is no need for the #pragma omp atomic
. Each thread works on a different i
- so no two threads could ever try to update the same c[i][j]
. Removing the atomic
will greatly improve performance. If you ever need atomic, also consider reduction as an alternative.
If you want to print omp_get_num_threads
, you should do it outside of the loop, but inside the parallel region. In your case this means you have to split omp parallel for
into a omp parallel
and omp for
. Use omp single
to make sure only one thread outputs.
Be aware that very good performance from a matrix multiplication is much more complicated and beyond the scope of this question.
Edit:
For nested loops it is generally better to parallelize the outermost loop if possible - i.e. no data dependency that prevents it. There can be cases where the outermost loop does not yield enough parallelism - in those cases you would rather use collapse(2)
to parallelize the outer 2 loops. Do not use (parallel) for
twice unless you absolutely know what you are doing. The reason for this is that parallelizing the middle loop yields more smaller pieces of work which increases the relative overhead.
In your specific case one can safely assume TAM_MATRIZ >> n_threads
0, which means the outermost loop has enough parallel work for all threads to be used efficiently.
To reiterate the data-sharing rules. For a normal parallel
region.
parallel
region (and parallel loop variables) are implicitly private. Those are variables your threads work on. If a variable is only used within a lexical scope, always1 define it in the narrow-most possible lexical scope.If you follow this, there is almost never a need to explicitly define the private
/shared
data-sharing attributes2.
0 Otherwise it wouldn't even make sense to use OpenMP here.
1 Exceptions apply for non-trivial C++ types with expensive ctors.
2 reduction
/ firstprivate
are useful to be used explicitly.