Compiler errors for GCC (via CUDA) intrinsic functions, but I'm not using any

Problem

I wrote a bunch of small CUDA programs. Most of them compile fine in debug and release builds. However, a few fail when compiled in Release mode because various GCC intrinsics are being given the wrong types of pointers. But I'm not actually using intrinsics. This program partially reproduces my problem:

#include <iostream>
#include <cuda_runtime.h> // To pacify the syntax highlighter
#include <immintrin.h> // NOTE: I don't ever include this header in my real code

__global__ void kernel() {
  // Do nothing
}

using namespace std;

int main() {
  kernel<<<1, 1>>>();
  cout << "Hello, world!" << endl;

  return 0;
}

The problem, however, is that in my actual code I do not include <immintrin.h> or use GCC intrinsics of any kind. It's possible that some library code I use does, but I don't know for sure. If I remove <immintrin.h> from this example, the program compiles and runs fine.

The actual offenders are here and here, if you want to see them.

Relevant Facts

I am using the following software:
- Ubuntu 17.04
- nvcc version 8.0.44
- gcc version 5.4.1
- cmake version 3.8.20170418
The projects build and run perfectly fine in Debug mode, including the sample program above.
Most of my small CUDA programs compile fine in Release builds, but I can't identify any patterns among the ones that fail.
The Release build compilation command is as follows:
- /usr/bin/g++-5 -std=c++11 -fopenmp -O3 -DNDEBUG -rdynamic CMakeFiles/DotProduct.dir/DotProduct_generated_main.cu.o CMakeFiles/DotProduct.dir/DotProduct_intermediate_link.o -o DotProduct -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -ldl -lrt ../../Common/libCommon.a -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -ldl -lrt -Wl,-Bstatic -lcudadevrt -Wl,-Bdynamic -L/usr/lib/x86_64-linux-gnu -lSDL2 -lSDL2_ttf -lSDL2 -lGLEW -lGLU -lGL
The full build log is too large to attach, but can be found here. Here's a relevant sample of the errors:

/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9533): error: argument of type "void *" is incompatible with parameter of type "long long *"
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h(9542): error: argument of type "void *" is incompatible with parameter of type "long long *"
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512pfintrin.h(54): error: argument of type "const void *" is incompatible with parameter of type "const long long *"
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512pfintrin.h(62): error: argument of type "const void *" is incompatible with parameter of type "const int *"
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512pfintrin.h(70): error: argument of type "const void *" is incompatible with parameter of type "const long long *"
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512pfintrin.h(78): error: argument of type "const void *" is incompatible with parameter of type "const int *"
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512pfintrin.h(86): error: argument of type "void *" is incompatible with parameter of type "const long long *"

~~The error log does not say who is actually using these intrinsics.~~
As it turns out, the sequence of headers is as follows: <random> -> <opt_random.h> -> <x86intrin.h> -> <immintrin.h> -> (every other header mentioned in the error log). My new goal is now to enable all the usual optimizations except those which intrinsics.

Solution

It turns out that this is likely an nvcc bug stemming from CUDA's stated lack of support for my particular system configuration. I filed a report here (you need to be logged in to see it).

For now, I worked around it by not using anything that requires intrinsics. In my case I used Thrust's random number generators instead of the standard library's. Someone I talked to suggested that I could also separate my host and device code more carefully such that the source files processed by nvcc don't ever include <immintrin.h>. Haven't tried it but for those in the future who see this, it's worth a shot.