Edit: I've filed this as a bug at https://developer.nvidia.com/nvidia_bug/3711214.
I'm writing a numerical simulation program that is giving subtly-incorrect results in Release mode, but seemingly correct results in Debug mode. The original program used curand for random sampling, but I've reduced it to a much simpler and more deterministic MVCE which launches a single kernel of 1 block * 1 warp (of 32 threads), where each thread:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
__global__ void test_kernel()
{
int cSteps = 0;
int cIters = 0;
float pos = 0;
//curandState localState = state[threadIdx.x];
while (true) {
float rn = threadIdx.x * 0.01 + 0.001;
pos += rn;
cSteps++;
if (pos > 1.0f) {
pos = 0;
cIters++;
if (cSteps > 1024) {
break;
}
}
}
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 2, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 2, 32);
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
printf("16: Th %2d cI %d\n", threadIdx.x, cIters);
}
int main()
{
test_kernel <<<1, 32>>> ();
return 0;
}
In debug mode, the shuffle works as expected. I see each thread start out with its own value:
0: Th 0 cI 2
0: Th 1 cI 12
0: Th 2 cI 22
0: Th 3 cI 32
0: Th 4 cI 41
// ...
after the first shuffle xor 1, each pair of threads agrees on the same number:
1: Th 0 cI 14
1: Th 1 cI 14
1: Th 2 cI 54
1: Th 3 cI 54
after the shuffle xor 2, each group of four threads agrees:
2: Th 0 cI 68
2: Th 1 cI 68
2: Th 2 cI 68
2: Th 3 cI 68
2: Th 4 cI 223
2: Th 5 cI 223
2: Th 6 cI 223
2: Th 7 cI 223
and so on. After the last shuffle, all threads in the warp agree on the same value (4673).
As soon as I enable Release mode, I get results that are subtly garbage. The values entering the shuffle are the same, and the values after the first round of the shuffle agree with the debug build (and agree within each pair as before). As soon as I do a shuffle xor 2, the results fall apart:
2: Th 0 cI 28
2: Th 1 cI 28
2: Th 2 cI 108
2: Th 3 cI 108
2: Th 4 cI 186
2: Th 5 cI 186
2: Th 6 cI 260
2: Th 7 cI 260
In fact, this is the exact output that a debug build (and hand inspection) would produce if the shuffle sequence were replaced by this specific broken one:
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32); // 2 changed to 1
cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32); // 2 changed to 1
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
The full diff of the output is here.
Hardware and software environment is as follows:
GA103 3080Ti (mobile), at manufacturer-recommended clocks, 16 G VRAM. Machine doesn't seem to be having corruption with other Cuda programs (tested with primegrid-CUDA and tasks verified against double-checks)
CUDA 11.0
MVSC host compiler 14.29.30133
Full debug command line as follows:
"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -G --keep-dir x64\Debug -maxrregcount=0 --machine 64 --compile -cudart static -g -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64\Debug\vc142.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\kernel.cu.obj "C:\Users\[username]\source\repos\BugRepro\BugRepro\kernel.cu"
Full release command line as follows:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc142.pdb /FS /Zi /MD " -o x64\Release\kernel.cu.obj "C:\Users\[username]\source\repos\BugRepro\BugRepro\kernel.cu"
Things I tried without resolution:
Having each thread write to global memory and then summing on the host CPU does produce correct results.
Replacing all the shuffles with calls to __shfl_sync
and manually-calculated lane IDs works. Replacing just the broken shuffle xor 2 with a __shfl_sync
doesn't. Replacing just the first shuffle xor 1 (which worked correctly) with a __shfl_sync
does seem to fix it. (These two workarounds apply to my MVCE; I have not had a chance to evaluate whether they apply to the full program)
// unexpectedly working
int id = threadIdx.x;
printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
__syncthreads();
cSteps += __shfl_sync(0xffffffff, cSteps, id ^ 1, 32);
cIters += __shfl_sync(0xffffffff, cIters, id ^ 1, 32);
printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 2, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 2, 32);
printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);
printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);
printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);
cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);
printf("16: Th %2d cI %d\n", threadIdx.x, cIters);
Even though I have a workaround, I'm afraid that I'm still hitting undefined behavior somewhere and my fix might be brittle.
Can anyone shed light on this? Is there indeed UB in my program? Is this a known compiler bug?
This was confirmed to be a compiler bug according to the CUDA engineering team. The fix is coming soon, as confirmed by a communication from them:
The fix is targeting a future major CUDA release after CUDA 11. The JIT fix will possibly be a little earlier in a Driver branch after latest R515 online.
Edit: Doesn't appear to be fixed in the 516.94 Game Ready driver. It does seem fixed in 522.25 with Cuda 11.8.
They also confirm that turning off optimizations fixes the issue; they do not comment on any workarounds that work reliably with optimization still on.
The following workarounds worked for me on my hardware and compiler, but YMMV:
__shfl_sync
instead of shfl_add_sync
or shfl_xor_sync
__reduce_add_sync