Search code examples
openacc

Erroneous result using OpenACC "collapse" when too many levels are collapsed


I ran into a strange bit of “magic” with OpenACC while I was working on a much larger analysis (using MPI, too, hence the “mpi…” calls below) that appears to arise from the use of “collapse” in a #pragma call. I found a workable solution to the problem, but I don't understand either the cause of the problem or why my solution worked. I have written the following MRE (apologies for the contrived nature of the code, but it arose from my original analysis) that produces a similar confusing result. On line 42 of the following MRE, using “collapse(2)” (shown below) gives the expected result, but using “collapse(3)” gives a completely wrong result, even though both perform the same number of passes through the loops. The error appears to be deterministic, because I get the same, erroneous output each time I run it.

I am hoping that I can get answers to three questions: 1) Why does “collapse(3)” give the wrong results, even though the number of sums is the same? 2) Why does that problem go away when you use “collapse(2)”? 3) Using “collapse(2)”, is there some OpenACC pragma that could be used to speed up the innermost loop (on variable “j”) that would still give the correct result?

#include <mpi.h>

void* allocMatrix (int nRow, int nCol) {
                void* restrict m = malloc (sizeof(int[nRow][nCol]));
                        return(m);
}

int main(int argc, char *argv[]) {
        int nrows = 100;
        int ncols = 15;
        int blockSize = 10;
        int blockOffset = blockSize/2;
        int blockStart;
        int maxSize = 5;

        int (*a)[ncols] = (int (*)[ncols])allocMatrix(nrows, ncols);
        int (*sum_XY)[maxSize] = (int (*)[maxSize])allocMatrix(blockSize, maxSize);

        int nbrBlocks = (nrows-blockSize)/(blockSize/2) + 1;
        int nbrSums = 0;

        int blockNbr, x, y, j, xx, yy;
        int sumAll = 0;

        memset( a, 1, sizeof(int)*nrows*ncols );

        for ( blockNbr=0; blockNbr<nbrBlocks; blockNbr++ ) {
                blockStart = (blockOffset) * blockNbr;

                for ( x=0; x<blockSize; x++ ) {
                        for ( y=0; y<maxSize; y++ ) {
                                sum_XY[x][y] = 0;
                        }
                }
         
#pragma acc parallel loop gang device_type(acc_device_nvidia) vector_length(256) collapse(2) private(x,y,j,xx,yy)
//#pragma acc parallel loop gang device_type(acc_device_nvidia) vector_length(256) collapse(3) private(x,y,j,xx,yy)
                for ( x=0; x<blockSize; x++ ) {
                        for ( y=0; y<maxSize; y++ ) {
                                for ( j=0; j<ncols; j++ ) {
                                        xx = x + blockStart;
                                        yy = xx - y;
                                        if ( yy >= 0 ) {
                                                sum_XY[x][y] += a[xx][j];
                                                nbrSums += 1;
                                        }
                                }
                        }
                }

                for ( x=0; x<blockSize; x++ ) {
                        for ( y=0; y<maxSize; y++ ) {
                                sumAll += sum_XY[x][y];
                        }
                }

        }

        printf( "sumAll : %d\n", sumAll );
        printf( "nbrSums: %d\n", nbrSums );
}

The compiler output and results:

On line 42, using “collapse(2)”:

$ mpic++ -mcmodel=medium -acc -ta=tesla:managed -Minfo=accel MRE_collapse.c -o MRE_collapse
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
main:
     44, Generating NVIDIA GPU code
         44, #pragma acc loop gang, vector(256) collapse(2) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(+:nbrSums)
         45,   /* blockIdx.x threadIdx.x collapsed */
         46, #pragma acc loop seq
     44, Generating implicit copy(sum_XY[:10][:5],nbrSums) [if not already present]
         Generating implicit copyin(a[blockStart:10][:15]) [if not already present]
     46, Complex loop carried dependence of sum_XY->,a-> prevents parallelization
         Loop carried dependence of sum_XY-> prevents parallelization
         Loop carried backward dependence of sum_XY-> prevents vectorization
$ mpirun  -np 1 MRE_collapse
sumAll : 1263225620
nbrSums: 14100
$

On line 42, using “collapse(3)”:

$ mpic++ -mcmodel=medium -acc -ta=tesla:managed -Minfo=accel MRE_collapse.c -o MRE_collapse
nvc++-Warning-CUDA_HOME has been deprecated. Please, use NVHPC_CUDA_HOME instead.
main:
     44, Generating NVIDIA GPU code
         44, #pragma acc loop gang, vector(256) collapse(3) /* blockIdx.x threadIdx.x */
             Generating implicit reduction(+:nbrSums)
         45,   /* blockIdx.x threadIdx.x collapsed */
         46,   /* blockIdx.x threadIdx.x collapsed */
     44, Generating implicit copy(sum_XY[:10][:5],nbrSums) [if not already present]
         Generating implicit copyin(a[blockStart:10][:15]) [if not already present]
$ mpirun  -np 1 MRE_collapse
sumAll : -1347440724
nbrSums: 14100
$

Solution

  • The "j" loop has a race condition on "sum_XY[x][y] += a[xx][j];". To fix, you'll want to add an atomic update.

          if ( yy >= 0 ) {
             #pragma acc atomic update
             sum_XY[x][y] += a[xx][j];
             nbrSums += 1;
          }
    

    This will allow you to use the "collapse(3)"

    Atomics do incur some overhead, but the amount is largely determined by how often a collision occurs.

    A second method would be use "loop vector reduction" on the j loop, replacing sum_XY with a scalar:

           int sumxy=0;
     #pragma acc loop vector reduction(+:sumxy)
            for ( j=0; j<ncols; j++ ) {
                  xx = x + blockStart;
                  yy = xx - y;
                  if ( yy >= 0 ) {
                      sumxy += a[xx][j];
                      nbrSums += 1;
                  }
            }
            sum_XY[x][y]=sumxy;
    

    The problem here is that ncols is only 15, so you'd be wasting threads. Even removing "vector_lenght(256)" so only 32 threads are used, 17 are still idle. Though if in the full code ncols is bigger (>64) then this may be a better option.