OpenACC data movement

I am very new to OpenACC and I don't quite understand about the data movement and the "#pragma acc data" clause.

I have a program written in C. An extract from the code is like that:

#pragma acc data create(intersectionSet[0:intersectionsCount][0:4]) // line 122
#pragma acc kernels // line 123
for (int i = 0; i<intersectionsCount; i++){ // line 124
    intersectionSet[i][0] = 9; // line 125
}

intersectionsCount has value 210395. After compiling and run the above code by the following:

pgcc -o rect_openacc -fast -Minfo -acc -ta=nvidia,time rect.c

I have this output:

    time(us): 1,475,607
122: data region reached 1 time
    31: kernel launched 210395 times
        grid: [1]  block: [128]
         device time(us): total=1,475,315 max=15 min=7 avg=7
        elapsed time(us): total=5,451,647 max=24,028 min=24 avg=25
123: compute region reached 1 time
    124: kernel launched 1 time
        grid: [1644]  block: [128]
         device time(us): total=292 max=292 min=292 avg=292
        elapsed time(us): total=312 max=312 min=312 avg=312
156: data region reached 1 time

I have some questions after reading the output:

I don't know why it said line 31, as line 31 has no acc pragma. Does it mean something that I can't trace?
In the line "31: kernel launched 210395 times", it said it launched 210395 times the kernel. I don't know if it is normal that the kernel needs to launch so many times, because this part has taken 5,451,647(us) and I think it is a little bit long. I think the for-loop is simple and shouldn't take so much time. Am I using the pragma in a wrong way?

Update
I do have a couple of header files for the program. But those files do not have "acc data" or "acc kernels" pragma.

After compiling the code with "-Minfo=all", the result is as follows:

breakStringToCharArray:
 11, include "stringHelper.h"
      50, Loop not vectorized/parallelized: contains call
countChar:
 11, include "stringHelper.h"
      74, Loop not vectorized/parallelized: not countable
extractCharToIntRequiredInt:
 11, include "stringHelper.h"
      93, Loop not vectorized/parallelized: contains call
extractArray:
 12, include "fileHelper.h"
      49, Loop not vectorized/parallelized: contains call
isRectOverlap:
 13, include "shapeHelper.h"
      23, Generating acc routine vector
          Generating Tesla code
getRectIntersection:
 13, include "shapeHelper.h"
      45, Generating acc routine vector
          Generating Tesla code
getRectIntersectionInGPU:
 13, include "shapeHelper.h"
      69, Generating acc routine vector
          Generating Tesla code
max:
 13, include "shapeHelper.h"
      98, Generating acc routine vector
          Generating Tesla code
min:
 13, include "shapeHelper.h"
     118, Generating acc routine vector
          Generating Tesla code
main:
64, Loop not vectorized/parallelized: contains call
108, Loop not vectorized/parallelized: contains call
122, Generating create(intersectionSet[:intersectionsCount][:4])
124, Loop is parallelizable
     Accelerator kernel generated
     Generating Tesla code
124, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

I create intersectionSet in this way:

intersectionSet = (int **)malloc(sizeof(int **) * intersectionsCount);
for (i = 0; i<intersectionsCount; i++){
    intersectionSet[i] = (int *)malloc(sizeof(int *) * 4);
}

Solution

What's happening is that since you have pointer to pointers array, "**", (at least I'm guessing that's what intersectionSet is) the compiler must first allocate the pointer to pointer on the device, then loop over each element to allocate the individual device arrays. Finally, it then needs to launch a kernel to set the pointer value on the device. Here's some Pseudo-code to help illustrate.

devPtrPtr = deviceMalloc(numElements*pointer size);
for (i=0; i < numElements; ++i) {
   devPtr = deviceMalloc(elementSize * dataTypeSize);
   call deviceKernelToSetPointer<<<1,128>>(devPtrPtr[i],devPtr);
}

To help your code, I'd switch the dimensions making the column length 4 and the row length "intersectionsCount". This will also help the data access on the device since the "vector" loop should correspond to the stride-1 (contiguous) dimension in order to avoid memory divergence.

Hope this helps,

Mat