openacc and cache tilling

----- example code -----------

for (body1 = 0; body1 < NBODIES; body1 ++) {
   for (body2=0; body2 < NBODIES; body2++) {
     OUT[body1] += compute(body1, body2);
   }
}

----- blocking code------

for (body2 = 0; body2 < NBODIES; body2 += BLOCK) {
   for (body1=0; body1 < NBODIES; body1 ++) {
      for (body22=0; body22 < BLOCK; body22 ++) {
         OUT[body1] += compute(body1, body2 + body22);
      }
   }
}

I insert OpenACC directives to offload code to GPU. But the performance was decreasing. I search some paper,and they conclude the reason is that OpenACC can not take advantage of shared memory in GPU. But I think the main reason is that the tilling/blocking prevent the parallell. Because the tilling bring data dependence. If the OpenACC do not supply or not encourage code tilling? If there is a solusion or example that tilling tech improve the OpenACC code.

Solution

OpenACC can do automatic and explicit tiling (via the tile clause) however, I don't think this is your issue. The problem I see is that the body2 loop is not parallelizable due to the dependency on "OUT[body1]". OpenACC can perform scalar reductions in parallel, hence you might try the following:

  #pragma acc parallel loop 
  for (body1 = 0; body1 < NBODIES; body1 ++) {
    sum = 0.0;
  #pragma acc loop reduction(+:sum)
    for (body2=0; body2 < NBODIES; body2++) {
      sum += compute(body1, body2);
    }
    OUT[body1] += sum;
  }

Granted, I'm guessing here so if this does not help, please post an compliable example of the issue. If you are using PGI, please post the compiler feedback messages (-Minfo=accel) as well.