I parallelized my code like that:
for (int i=0; i<size; ++i) {
#pragma omp parallel for
for (int j=i; j<size; ++j) {
int l = j+1;
float sum = a[i*size+j];
float sum2 = a[l*size+i];
for (int k=0; k<i; ++k) {
sum -= a[i*size+k] * a[k*size+j];
sum2 -= a[l*size+k] * a[k*size+i];
}
a[i*size+j]=sum;
a[l*size+i]=sum2;
}
#pragma omp parallel for
for (int j=i+1; j<size; ++j) {
a[j*size+i]/=a[i*size+i];
}
}
But I want it to be like this:
for (int i=0; i<size; ++i) {
#pragma omp parallel for
for (int j=i; j<size; ++j) {
int l = j+1;
float sum = a[i*size+j];
float sum2 = a[l*size+i];
for (int k=0; k<i; ++k) {
sum -= a[i*size+k] * a[k*size+j];
sum2 -= a[l*size+k] * a[k*size+i];
}
a[i*size+j]=sum;
a[l*size+i]=sum2;
a[l*size+i]/=a[i*size+i];
}
}
So I can get better performance. However, if I'm putting a[l*size+i]/=a[i*size+i];
into the same loop as the other stuff, I'm getting a different result than I should. I guess it's because of the OpenMP directives because, without them, both have the same result.
I would be happy if someone could give me some tips on how to make this possible or how to improve the performance in general.
Without redesigning the code you can try something like:
#pragma omp parallel
{
for (int i=0; i<size; ++i)
{
#pragma omp for
for (int j=i; j<size; ++j) {
int l = j+1;
float sum = a[i*size+j];
float sum2 = a[l*size+i];
for (int k=0; k<i; ++k) {
sum -= a[i*size+k] * a[k*size+j];
sum2 -= a[l*size+k] * a[k*size+i];
}
a[i*size+j]=sum;
a[l*size+i]=sum2;
}
#pragma omp for
for (int j=i+1; j<size; ++j)
a[j*size+i]/=a[i*size+i];
}
}
Instead of creating the parallel region 2x times per loop i iterations (total of 2 * size parallel regions) you can create a single one. Nonetheless, in an efficient implementation of the OpenMP standard a new parallel region does not introduce that much of overhead that one might think, because typically the threads will be created the first time and reused on the next parallel regions.
Notwithstanding, one of the overheads of having multiple parallel regions is the call to the implicit barrier at the end of them. Unfortunately, that overhead is still present on the version that I am presenting. To avoid that you would need to redesign the algorithm.