I am very new to OpenACC and I don't quite understand about the data movement and the "#pragma acc data" clause.
I have a program written in C. An extract from the code is like that:
#pragma acc data create(intersectionSet[0:intersectionsCount][0:4]) // line 122
#pragma acc kernels // line 123
for (int i = 0; i<intersectionsCount; i++){ // line 124
intersectionSet[i][0] = 9; // line 125
}
intersectionsCount has value 210395. After compiling and run the above code by the following:
pgcc -o rect_openacc -fast -Minfo -acc -ta=nvidia,time rect.c
I have this output:
time(us): 1,475,607
122: data region reached 1 time
31: kernel launched 210395 times
grid: [1] block: [128]
device time(us): total=1,475,315 max=15 min=7 avg=7
elapsed time(us): total=5,451,647 max=24,028 min=24 avg=25
123: compute region reached 1 time
124: kernel launched 1 time
grid: [1644] block: [128]
device time(us): total=292 max=292 min=292 avg=292
elapsed time(us): total=312 max=312 min=312 avg=312
156: data region reached 1 time
I have some questions after reading the output:
Update
I do have a couple of header files for the program. But those files do not have "acc data" or "acc kernels" pragma.
After compiling the code with "-Minfo=all", the result is as follows:
breakStringToCharArray:
11, include "stringHelper.h"
50, Loop not vectorized/parallelized: contains call
countChar:
11, include "stringHelper.h"
74, Loop not vectorized/parallelized: not countable
extractCharToIntRequiredInt:
11, include "stringHelper.h"
93, Loop not vectorized/parallelized: contains call
extractArray:
12, include "fileHelper.h"
49, Loop not vectorized/parallelized: contains call
isRectOverlap:
13, include "shapeHelper.h"
23, Generating acc routine vector
Generating Tesla code
getRectIntersection:
13, include "shapeHelper.h"
45, Generating acc routine vector
Generating Tesla code
getRectIntersectionInGPU:
13, include "shapeHelper.h"
69, Generating acc routine vector
Generating Tesla code
max:
13, include "shapeHelper.h"
98, Generating acc routine vector
Generating Tesla code
min:
13, include "shapeHelper.h"
118, Generating acc routine vector
Generating Tesla code
main:
64, Loop not vectorized/parallelized: contains call
108, Loop not vectorized/parallelized: contains call
122, Generating create(intersectionSet[:intersectionsCount][:4])
124, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
124, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
I create intersectionSet in this way:
intersectionSet = (int **)malloc(sizeof(int **) * intersectionsCount);
for (i = 0; i<intersectionsCount; i++){
intersectionSet[i] = (int *)malloc(sizeof(int *) * 4);
}
What's happening is that since you have pointer to pointers array, "**", (at least I'm guessing that's what intersectionSet is) the compiler must first allocate the pointer to pointer on the device, then loop over each element to allocate the individual device arrays. Finally, it then needs to launch a kernel to set the pointer value on the device. Here's some Pseudo-code to help illustrate.
devPtrPtr = deviceMalloc(numElements*pointer size);
for (i=0; i < numElements; ++i) {
devPtr = deviceMalloc(elementSize * dataTypeSize);
call deviceKernelToSetPointer<<<1,128>>(devPtrPtr[i],devPtr);
}
To help your code, I'd switch the dimensions making the column length 4 and the row length "intersectionsCount". This will also help the data access on the device since the "vector" loop should correspond to the stride-1 (contiguous) dimension in order to avoid memory divergence.
Hope this helps,
Mat