Why does "#pragma omp loop for" produce a compile error?

Why doesn't this compile? The following MRE is built from a number of examples trying to get OpenMP and OpenACC to work together. The compile command produces the following error:

$ mpic++ -mcmodel=medium -fopenmp -acc -ta=tesla:managed -Minfo=accel mp_acc.c -o mp_acc
"mp_acc.c", line 49: error: invalid text in pragma
               #pragma omp loop for
                                ^

1 error detected in the compilation of "mp_acc.c".
$

I am running:

$ mpic++ --version

nvc++ 22.1-0 64-bit target on x86-64 Linux -tp skylake-avx512 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

The C++ array allocation is a bit new to me, but it does appear to work better than the array-of-array-of-pointers approach. The MPI compiler doesn't tell me what is "invalid" about a standard, pragma command.

Here is the MRE:

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <cstdlib>
#include <string>
#include <mpi.h>
#include <omp.h>
#include <openacc.h>
#include <bits/stdc++.h>
#include <sys/stat.h>
using namespace std;

void* allocMatrix (int nRow, int nCol) {
        void* restrict m = malloc (sizeof(int[nRow][nCol]));
        return(m);
}

#pragma acc routine gang 
void* func(void* a, int nrows, int ncols) {
        return(a);
}
int main(int argc, char *argv[]) {
        int nrows = 5;
        int ncols = 3;
        int (*a)[ncols] = (int (*)[ncols])allocMatrix(nrows, ncols);
        int* restrict ta = (int*)malloc(nrows * sizeof(int));

        for ( int i=0; i<nrows; i++ ) {
                for ( int j=0; j<ncols; j++ ) {
                        a[i][j] = 1;
                }
        }
        for ( int i=0; i<nrows; i++ ) {
                for ( int j=0; j<ncols; j++ ) {
                        cout << a[i][j] << " ";
                }
                cout << endl;
        }

        #pragma omp parallel num_threads() 
        {
               size_t tid = omp_get_thread_num();

               #pragma omp loop
               for (int i = 0; i < nrows; ++i) {
                       #pragma acc parallel deviceptr(a,nrows,ncols) async(tid)
                       {
                          a = (int (*)[ncols]) func(a, nrows, ncols);
                       }
               }
               #pragma acc wait
        }
        for ( int i=0; i<nrows; i++ ) {
                for ( int j=0; j<ncols; j++ ) {
                        cout << a[i][j] << " ";
                }
                cout << endl;
        }


        memset( a, 0, nrows*ncols*sizeof(int) );
        for ( int i=0; i<nrows; i++ ) {
                for ( int j=0; j<ncols; j++ ) {
                        cout << a[i][j] << " ";
                }
                cout << endl;
        }

        free(a);
}

Solution

First of all one issue in your code is that you are mixing OpenMP and OpenACC directives. This is two different application programming interface (API) targetting different features (though some can be very similar). omp is for OpenMP and acc is for OpenACC. You should not mix them unless the API explicitly mention it is supported (AFAICT, this is not the case).

You should not use OpenACC directive in a parallel OpenMP section. GPU kernels are not faster when kernels are launched from multiple threads because a GPU kernel is already parallel and 1 efficient kernel can saturate the GPU computing units. Concurrence can actually decrease performance in this case (due to the cumulated overheads and generally no speed up). Not to mention it makes the code significantly harder to debug and increase the probability of introducing bugs like race conditions (assuming OpenACC actually supports this pattern).

The main issue is that your directive is invalid. Indeed, omp loop is a directive similar to omp for, but the OpenMP specification does not explicitly allow them to mix them together. Thus, omp loop for is not a valid directive and the compiler complain for the word for which cannot be used here.

AFAIK, omp loop is meant to be used in a target code (eg. on GPUs) and not really on basic multi-threaded CPU codes. It is not very clear to me what you want to achieve with this code but if you want to execute the computation on a target device, then I advise to use only OpenMP (assuming it is well supported on your target platform, otherwise OpenACC which tends to be better supported on HPC platforms). I strongly encourage you to read tutorials first so to understand the basics of OpenMP/OpenACC.

By the way, note that the command mpic++ is generally a wrapper and not (really) a compiler. It typically run a compiling command with additional compilation options/flags using the compiler GCC/Clang/ICC/PGI under the hood. The OpenMP support is generally independent of MPI, but the underlying compiler. You can often get more information about it with the options --version or --help (or --showme for OpenMPI).