Why is my CUDA warp shuffle sum using the wrong offset for one shuffle step?

Edit: I've filed this as a bug at https://developer.nvidia.com/nvidia_bug/3711214.

I'm writing a numerical simulation program that is giving subtly-incorrect results in Release mode, but seemingly correct results in Debug mode. The original program used curand for random sampling, but I've reduced it to a much simpler and more deterministic MVCE which launches a single kernel of 1 block * 1 warp (of 32 threads), where each thread:

Performs a computation with a loop that will likely become warp-divergent, especially near the end as some threads complete their task before others.
Syncs the threads back together.
Attempts to butterfly-shuffle data with fellow threads in the warp to obtain a single sum.
[not needed in the MVCE] thread 0 would write the sum back to global memory so it can be copied to the host

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>


__global__ void test_kernel()
{

    int cSteps = 0;
    int cIters = 0;
    float pos = 0;

    //curandState localState = state[threadIdx.x];

    while (true) {
        float rn = threadIdx.x * 0.01 + 0.001;
        pos += rn;
        cSteps++;
        if (pos > 1.0f) {
            pos = 0;
            cIters++;
            if (cSteps > 1024) {
                break;
            }
        }
    }

    printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
    __syncthreads();
    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32);

    printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 2, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 2, 32);

    printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);

    printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);

    printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);

    printf("16: Th %2d cI %d\n", threadIdx.x, cIters);
}

int main()
{
    test_kernel <<<1, 32>>> ();
    return 0;
}

In debug mode, the shuffle works as expected. I see each thread start out with its own value:

 0: Th 0 cI 2
 0: Th 1 cI 12
 0: Th 2 cI 22
 0: Th 3 cI 32
 0: Th 4 cI 41
// ...

after the first shuffle xor 1, each pair of threads agrees on the same number:

 1: Th  0 cI 14
 1: Th  1 cI 14
 1: Th  2 cI 54
 1: Th  3 cI 54

after the shuffle xor 2, each group of four threads agrees:

 2: Th  0 cI 68
 2: Th  1 cI 68
 2: Th  2 cI 68
 2: Th  3 cI 68
 2: Th  4 cI 223
 2: Th  5 cI 223
 2: Th  6 cI 223
 2: Th  7 cI 223

and so on. After the last shuffle, all threads in the warp agree on the same value (4673).

As soon as I enable Release mode, I get results that are subtly garbage. The values entering the shuffle are the same, and the values after the first round of the shuffle agree with the debug build (and agree within each pair as before). As soon as I do a shuffle xor 2, the results fall apart:

 2: Th  0 cI 28
 2: Th  1 cI 28
 2: Th  2 cI 108
 2: Th  3 cI 108
 2: Th  4 cI 186
 2: Th  5 cI 186
 2: Th  6 cI 260
 2: Th  7 cI 260

In fact, this is the exact output that a debug build (and hand inspection) would produce if the shuffle sequence were replaced by this specific broken one:

    printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
    __syncthreads();
    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32);

    printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 1, 32); // 2 changed to 1
    cIters += __shfl_xor_sync(0xffffffff, cIters, 1, 32); // 2 changed to 1

    printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);

    printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);

    printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);

The full diff of the output is here.

Hardware and software environment is as follows:

GA103 3080Ti (mobile), at manufacturer-recommended clocks, 16 G VRAM. Machine doesn't seem to be having corruption with other Cuda programs (tested with primegrid-CUDA and tasks verified against double-checks)
CUDA 11.0
MVSC host compiler 14.29.30133

Full debug command line as follows:

"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu   -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include"  -G   --keep-dir x64\Debug -maxrregcount=0  --machine 64 --compile -cudart static  -g   -DWIN32 -DWIN64 -D_DEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /Od /Fdx64\Debug\vc142.pdb /FS /Zi /RTC1 /MDd " -o x64\Debug\kernel.cu.obj "C:\Users\[username]\source\repos\BugRepro\BugRepro\kernel.cu"

Full release command line as follows:

C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\nvcc.exe" -gencode=arch=compute_52,code=\"sm_52,compute_52\" --use-local-env -ccbin "C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.29.30133\bin\HostX86\x64" -x cu   -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\include"     --keep-dir x64\Release -maxrregcount=0  --machine 64 --compile -cudart static     -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /Fdx64\Release\vc142.pdb /FS /Zi  /MD " -o x64\Release\kernel.cu.obj "C:\Users\[username]\source\repos\BugRepro\BugRepro\kernel.cu"

Things I tried without resolution:

Adding/removing syncthreads calls (where one is shown, and between shuffle calls), even though they shouldn't be necessary since each shuffle synchronizes
Changing the compute capability to 8.0 to better match my card
Forcing base clocks on the GPU
Shuffling in the opposite order (16/8/4/2/1)
Using __shfl_down_sync instead of xor, with the same pattern of offsets.

Having each thread write to global memory and then summing on the host CPU does produce correct results.

Replacing all the shuffles with calls to __shfl_sync and manually-calculated lane IDs works. Replacing just the broken shuffle xor 2 with a __shfl_sync doesn't. Replacing just the first shuffle xor 1 (which worked correctly) with a __shfl_sync does seem to fix it. (These two workarounds apply to my MVCE; I have not had a chance to evaluate whether they apply to the full program)

    // unexpectedly working
    int id = threadIdx.x;
    printf(" 0: Th %d cI %d\n", threadIdx.x, cIters);
    __syncthreads();
    cSteps += __shfl_sync(0xffffffff, cSteps, id ^ 1, 32);
    cIters += __shfl_sync(0xffffffff, cIters, id ^ 1, 32);

    printf(" 1: Th %2d cI %d\n", threadIdx.x, cIters);
    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 2, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 2, 32);

    printf(" 2: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 4, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 4, 32);

    printf(" 4: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 8, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 8, 32);

    printf(" 8: Th %2d cI %d\n", threadIdx.x, cIters);

    cSteps += __shfl_xor_sync(0xffffffff, cSteps, 16, 32);
    cIters += __shfl_xor_sync(0xffffffff, cIters, 16, 32);

    printf("16: Th %2d cI %d\n", threadIdx.x, cIters);

Even though I have a workaround, I'm afraid that I'm still hitting undefined behavior somewhere and my fix might be brittle.

Can anyone shed light on this? Is there indeed UB in my program? Is this a known compiler bug?

Solution

This was confirmed to be a compiler bug according to the CUDA engineering team. The fix is coming soon, as confirmed by a communication from them:

The fix is targeting a future major CUDA release after CUDA 11. The JIT fix will possibly be a little earlier in a Driver branch after latest R515 online.

Edit: Doesn't appear to be fixed in the 516.94 Game Ready driver. It does seem fixed in 522.25 with Cuda 11.8.

They also confirm that turning off optimizations fixes the issue; they do not comment on any workarounds that work reliably with optimization still on.

The following workarounds worked for me on my hardware and compiler, but YMMV:

using __shfl_sync instead of shfl_add_sync or shfl_xor_sync
__reduce_add_sync