Search code examples
cgccopenmpldoffloading

OpenMP Offload Error during linking with gcc with nvptx-none: unresolved symbol _fputwc_r


I am trying to compile a simple test problem using OpenMP offloading for an Nvidia GPU. I am using gcc with the nvptx-none target. I have installed the gcc+nvptx package with spack (or compiled gcc-13 with nvptx-tools myself, the results are the same). During linking, I get the error:

unresolved symbol _fputwc_r
collect2: error: ld returned 1 exit status
mkoffload: fatal error: x86_64-pc-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
compilation terminated.
lto-wrapper: fatal error: /path/to/spack/opt/spack/linux-centos8-x86_64_v3/gcc-13.0.0/gcc-12.2.0-6olbpwbs53cquwnpsvrmuxprmaofwjtk/libexec/gcc/x86_64-pc-linux-gnu/12.2.0//accel/nvptx-none/mkoffload returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed

Compiling with -fno-stack-protector, as recommended e.g. here or here, does not alleviate the problem. -fno-lto does, but then the offloading doesn't work. Different optimization flags make no difference.

The ld that is used is the system installation it seems. The spack installation provides another ld in spack/linux-centos8-x86_64_v3/gcc-13.0.0/gcc-12.2.0-6olbpwbs53cquwnpsvrmuxprmaofwjtk/nvptx-none, but spack doesn't add this to the PATH normally. I guess with good reason, because including it leads to

as: unrecognized option '--64'
nvptx-as: missing .version directive at start of file '/tmp/cc9YfveM.s'``

Is this a problem with the linker, or something else? The problem only occurs when actually including a parallel for loop, just setting #pragma omp target does not. The device is actually recognized, and code inside this pragma runs on the device according to OpenMP, as long as there is no parallel region present, which would produce above error.

Additional information: The system is Rocky Linux release 8.7 (Green Obsidian) The test program I am executing is based on the OpenMP test programs. It's full code is:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
void saxpy(float a, float* x, float* y, int sz) {
#pragma omp target teams distribute parallel for simd \
   num_teams(3) map(to:x[0:sz]) map(tofrom:y[0:sz])
   for (int i = 0; i < sz; i++) {
      if (omp_is_initial_device()) {
         printf("Running on host\n");    
      } else {
         int nthreads= omp_get_num_threads();
         int nteams= omp_get_num_teams(); 
         printf("Running on device with %d teams (fixed) in total and %d threads in each team\n",nteams,nthreads);
      }
      fprintf(stdout, "Thread %d %i\n", omp_get_thread_num(), i );
      y[i] = a * x[i] + y[i];
   }
}
int main(int argc, char** argv) {
   float a = 2.0;
   int sz = 16;
   float *x = calloc( sz, sizeof *x );
   float *y = calloc( sz, sizeof *y );
   //Set values
   int num_devices = omp_get_num_devices();
   printf("Number of available devices %d\n", num_devices);
   saxpy( a, x, y, sz );
   return 0;
}

I try to compile it with

gcc -O0 -fopenmp -foffload=nvptx-none -o mintest mintest.c

or with the flags mentioned above.


Solution

  • I guess the issue is that GCC cannot deal with the printf within the code region that is running on the GPU. GPUs typically are not good at any form of I/O happening and so you should avoid calling things like printf, read, write, etc. when within an offloaded code region.

    If you want to detect if the code was running on the GPU device or the host, then you can use a pattern like this:

    void test_on_gpu(void) {
        int on_device = 0;
        #pragma omp target teams map(from:on_device)
        {
            #pragma omp parallel
            {
                #pragma omp master
                {
                    if (0 == omp_get_team_num()) {
                        on_device = !omp_is_initial_device()
                    }
                }
            }
        }
        printf("on GPU: %s\n", on_device ? "yes" : "no");
    }
    

    What the code does is:

    • transition to the GPU device (target)
    • take one thread (the primary thread, master) in the first OpenMP team and the parallel region there
    • determine if execution happened on the GPU
    • return the test result via map(from:on_device)