I have read all the relevant questions I found, but still I could not find a solution to my issue, where I have a function, with a double for loop, that is the bottleneck of my program.
The code is designed wrt MPI:
update()
in a loop.Now I would like to augment my MPI code with OpenMp to get faster execution, by taking advantage of the double for loop of update()
.
void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
int i = 1, j = 1, end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0) {
i = 2;
end_i = subN - 1;
} else if ( id / pSqrt == (pSqrt - 1) ) {
i = 1;
end_i = subN - 2;
}
#pragma omp parallel for
for ( ; i < end_i; ++i) {
if (id % pSqrt == 0) {
j = 2;
end_j = subN - 1;
} else if ((id + 1) % pSqrt == 0) {
j = 1;
end_j = subN - 2;
}
#pragma omp parallel for
for ( ; j < end_j; ++j) {
gridNextPtr[i][j] = gridPtr[i][j] +
parms.cx * (gridPtr[i+1][j] +
gridPtr[i-1][j] -
2.0 * gridPtr[i][j]) +
parms.cy * (gridPtr[i][j+1] +
gridPtr[i][j-1] -
2.0 * gridPtr[i][j]);
}
}
}
I am running this on 2 computers, where every computer has 2 CPUs. I am using 4 processes. However, I see no speedup with and without OpenMp. Any ideas please? I am compiling with -O1
optimization flag.
What about this version (not tested)?
Please compile it and test it. If it works better, I'll explain it more. BTW, using some more aggressive compiler options might help too.
void update (int pSqrt, int id, int subN, float** gridPtr, float ** gridNextPtr)
{
int beg_i = 1, beg_j = 1;
int end_i = subN - 1, end_j = subN - 1;
if ( id / pSqrt == 0 ) {
beg_i = 2;
} else if ( id / pSqrt == (pSqrt - 1) ) {
end_i = subN - 2;
}
if (id % pSqrt == 0) {
beg_j = 2;
} else if ((id + 1) % pSqrt == 0) {
end_j = subN - 2;
}
#pragma omp parallel for schedule(static)
for ( int i = beg_i; i < end_i; ++i ) {
#pragma omp simd
for ( int j = beg_j; j < end_j; ++j ) {
gridNextPtr[i][j] = gridPtr[i][j] +
parms.cx * (gridPtr[i+1][j] +
gridPtr[i-1][j] -
2.0 * gridPtr[i][j]) +
parms.cy * (gridPtr[i][j+1] +
gridPtr[i][j-1] -
2.0 * gridPtr[i][j]);
}
}
}
EDIT: Some explanations on what I did to the code...
parrallel
region nested within another parallel
region). This was likely to be very counter effective and I simply removed it.i
and j
were declared and initialised outside of the for
loop statements. This is error-prone at two level: 1/ it might forces to declare their parallel scope (private
) whereas having them within the for
statement automatically give them the right one; and 2/ you can have mix-up by erroneously reusing the indexes outside of the loops. Moving them into the for
statement was easy.j
loops inside the parallel region for no good reason. You would have had to declare end_j
private. Moreover, it was a potential limitation for further developments (like potential use of the collapse(2)
directive), as it was breaking the rules for a canonical loop form as defined in the OpenMP standard. So defining some beg_i
and beg_j
outside of the parallel region was making sense, sparing computations and simplifying the form of the loops, keeping them canonical.Now from there, the code was suitable for vectorisation, and the addition of a simple simd
directive on the j
loop would enforce it, should the compiler fail to see the possible vectorisation by itself.