I'm working on LU decomposition of block diagonal matrices using OpenACC.
I get the correct decomposition when I run my code sequentially, whilst when executing it under OpecACC directives I get wrong result when conducting the decomposition.
LU decomposition involves nested loop of that type (see here LUPSolve
function):
for (unsigned int i = 0; i < N; i++)
for (unsigned int k = 0; k < i; k++)
It appears that when this type of nested loop is used in a routine seq
directive within a parallel region, the device always manages to enter the nested loop even when i=0
(which souldn't be possible because of the k<i
condition).
I made a simple code to check it:
#pragma acc routine seq
void test ( int* x, int const n ) {
for (unsigned int i = 0; i < n; i++) {
x[i] = -1;
for (unsigned int k = 0; k < i; k++)
x[i] = k < i;
}
}
int main ( ) {
unsigned const n(4);
unsigned const nb(3);
int x[nb*n];
#pragma acc parallel loop copyout(x[:nb*n])
for (unsigned int b = 0; b < nb; b++)
test(x+b*n,n);
// display x
}
The result I get is this one:
x = 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1,
But the correct one (which I get when I run the code without OpenACC) should be:
x = -1, 1, 1, 1, -1, 1, 1, 1, -1, 1, 1, 1,
I must be doing something wrong because it shouldn't enter the nested loop when i=0
...
Plus when I put the loops directly in the parallel region (without using a function call) it does work fine.
Looks like a compiler code generator issue where it's always executing the inner loop even when k and i are both zero. I've filed a problem report (TPR#24317) and sent it to our compiler engineers for further evaluation. As a work-around, add an "if" check in the inner loop.
% cat test.cpp
#include <stdio.h>
#include <stdlib.h>
#pragma acc routine seq
void test ( int* x, int const n ) {
for (unsigned int i = 0; i < n; i++) {
x[i] = -1;
for (unsigned int k = 0; k < i; k++) {
if (k < i)
x[i] = (k<i);
}
}
}
int main ( ) {
unsigned const n(4);
unsigned const nb(3);
int x[nb*n];
#pragma acc parallel loop copyout(x[:nb*n])
for (unsigned int b = 0; b < nb; b++)
test(x+b*n,n);
for (int i=0; i <nb; ++i) {
for (int j=0; j <n; ++j) {
printf("%d:%d %d\n", i,j, x[i*n+j]);
} }
exit(0);
}
% pgc++ -acc -Minfo=acc -ta=tesla:cc60 test.cpp; a.out
test(int *, int):
5, Generating acc routine seq
Generating Tesla code
main:
18, Generating copyout(x[:])
Accelerator kernel generated
Generating Tesla code
20, #pragma acc loop gang, vector(3) /* blockIdx.x threadIdx.x */
0:0 -1
0:1 1
0:2 1
0:3 1
1:0 -1
1:1 1
1:2 1
1:3 1
2:0 -1
2:1 1
2:2 1
2:3 1