Search code examples
openaccpgipgi-accelerator

Illegal context for vector clause in simple OpenACC kernel


I'm trying to compile a simple OpenACC benchmark:

void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
#pragma acc parallel copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
#pragma acc loop vector(128)
  {
    for (int i = 0; i < 256; ++i) {
      float sum = 0;
      for (int j = 0; j < 256; ++j) {
        sum += *(a + a_stride * i + j);
      }
      *(c + c_stride * i) = sum;
    }
  }
}

with Nvidia HPC SDK 21.5 and run into an error

$ nvc++ -S tmp.cc -Wall -Wextra -O2 -acc -acclibs -Minfo=all -g -gpu=cc80
NVC++-S-0155-Illegal context for gang(num:) or worker(num:) or vector(length:)  (tmp.cc: 7)
NVC++/x86-64 Linux 21.5-0: compilation completed with severe errors

Any idea what may cause this? From what I can tell my syntax for vector(128) is legal.


Solution

  • It's illegal OpenACC syntax to use "vector(value)" with a parallel construct. You need to use a "vector_length" clause on the parallel directive to define the vector length. The reason is because "parallel" defines a single compute region to be offloaded and hence all vector loops in this region need to have the same vector length.

    You can use "vector(value)" only with a "kernels" construct since the compiler can then split the region into multiple kernels each having a different vector length.

    Option 1:

    % cat test.c
    void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
    #pragma acc parallel vector_length(128) copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
    #pragma acc loop vector
      {
        for (int i = 0; i < 256; ++i) {
          float sum = 0;
          for (int j = 0; j < 256; ++j) {
            sum += *(a + a_stride * i + j);
          }
          *(c + c_stride * i) = sum;
        }
      }
    }
    % nvc -acc -c test.c -Minfo=accel
    foo:
          4, Generating copyout(c[:c_stride*256]) [if not already present]
             Generating copyin(a[:a_stride*256]) [if not already present]
             Generating Tesla code
              5, #pragma acc loop vector(128) /* threadIdx.x */
              7, #pragma acc loop seq
          5, Loop is parallelizable
          7, Loop is parallelizable
    

    Option 2:

    % cat test.c
    void foo(const float * restrict a, int a_stride, float * restrict c, int c_stride) {
    #pragma acc kernels copyin(a[0:a_stride*256]) copyout(c[0:c_stride*256])
    #pragma acc loop independent vector(128)
      {
        for (int i = 0; i < 256; ++i) {
          float sum = 0;
          for (int j = 0; j < 256; ++j) {
            sum += *(a + a_stride * i + j);
          }
          *(c + c_stride * i) = sum;
        }
      }
    }
    % nvc -acc -c test.c -Minfo=accel
    foo:
          4, Generating copyout(c[:c_stride*256]) [if not already present]
             Generating copyin(a[:a_stride*256]) [if not already present]
          5, Loop is parallelizable
             Generating Tesla code
              5, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
              7, #pragma acc loop seq
          7, Loop is parallelizable