Implementing a SpinLock in a HLSL DirectCompute shader

I try to implement a spin lock in a compute shader. But my implementation it doesn't seems to lock anything.

Here is how I implement the spin lock:

void LockAcquire()
{
    uint Value = 1;

    [allow_uav_condition]
    while (Value) {
        InterlockedCompareExchange(DataOutBuffer[0].Lock, 0, 1, Value);
    };
}

void LockRelease()
{
    uint Value;
    InterlockedExchange(DataOutBuffer[0].Lock, 0, Value);
}

Background: I need a spin lock because I have to compute a sum of data in a large 2 dimensions array. The sum is a double. Computing the sum with a single thread and a double loop produce the correct result. Computing the sum with multithreading produce the wrong result, even when introducing the spin lock to avoid conflict when computing the sum.

I can't use InterLockedAdd because the sum doesn't fit into a 32 bits integer and I'm using a shader model 5 (Compiler 47).

Here is the single thread version, producing the correct result:

[numthreads(1, 1, 1)]
void CSGrayAutoComputeSumSqr(
    uint3 Gid  : SV_GroupID,
    uint3 DTid : SV_DispatchThreadID, // Coordinates in RawImage window
    uint3 GTid : SV_GroupThreadID,
    uint  GI   : SV_GroupIndex)
{
    if ((DTid.x == 0) && (DTid.y == 0)) {
        uint2 XY;
        int   Mean = (int)round(DataOutBuffer[0].GrayAutoResultMean);
        for (XY.x = 0; XY.x < (uint)RawImageSize.x; XY.x++) {
            for (XY.y = 0; XY.y < (uint)RawImageSize.y; XY.y++) {
                int  Value  = GetPixel16BitGrayFromRawImage(RawImage, rawImageSize, XY);
                uint UValue = (Mean - Value) * (Mean - Value);
                DataOutBuffer[0].GrayAutoResultSumSqr += UValue;
            }
        }
    }
}

and below is the multithreaded version. This version produce similar but different result on each execution which IMO is caused by the not functioning lock.

[numthreads(1, 1, 1)]
void CSGrayAutoComputeSumSqr(
    uint3 Gid  : SV_GroupID,
    uint3 DTid : SV_DispatchThreadID, // Coordinates in RawImage window
    uint3 GTid : SV_GroupThreadID,
    uint  GI   : SV_GroupIndex)
{
    int  Value  = GetPixel16BitGrayFromRawImage(RawImage, RawImageSize, DTid.xy);
    int  Mean   = (int)round(DataOutBuffer[0].GrayAutoResultMean);
    uint UValue = (Mean - Value) * (Mean - Value);
    LockAcquire();
    DataOutBuffer[0].GrayAutoResultSumSqr += UValue;
    LockRelease();
}

Data used:

cbuffer TImageParams : register(b0)
{
    int2   RawImageSize;       // Actual image size in RawImage
}

struct TDataOutBuffer
{
    uint   Lock;                             // Use for SpinLock
    double GrayAutoResultMean;
    double GrayAutoResultSumSqr;
};

ByteAddressBuffer                  RawImage       : register(t0);
RWStructuredBuffer<TDataOutBuffer> DataOutBuffer  : register(u4);

Dispatch code:

FImmediateContext->CSSetShader(FComputeShaderGrayAutoComputeSumSqr, NULL, 0);
FImmediateContext->Dispatch(FImageParams.RawImageSize.X, FImageParams.RawImageSize.Y, 1);

The function GetPixel16BitGrayFromRawImage access RawImage byte address buffer to fetch 16 bit pixel value from a gray scale image. It produce the expected result.

Any help appreciated.

Solution

You are victim of the XY Problem here.

Let's start with the Y problem. Your spin lock does not lock. To understand why you spin lock does not work, you need to examine how a GPU handles the situation you are creating. You issue a warp, made of one or more thread-groups, each of which made by many threads. The execution of the warp is fast as long as the execution is parallel, that means that all the threads that make a warp (wavefront if you prefer), must execute the same instruction at the same time. Every time you insert a condition (like the while loop in your algo), some of your threads must take a route, and some the other. This is called divergence of the threads. The problem is that you cannot execute different instructions in parallel.

In this case the GPU can take one of two routes:

Dynamic branching which means that the wavefront (the warp) takes one of the 2 routes, and deactivates the threads that should take the other. Then, it rolls back to pick up the sleeping threads where they were left.
Flat branching which means that all threads execute both branches, then each thread discards the unwanted result and keeps the correct one.

Now the funny part:

There is no cast rule stating how a GPU is supposed to handle branching.

You have no way to predict whether the GPU will use one approach or the other, and in case of dynamic branching there is no way to know in advance if the GPU will put to sleep the straight route, the other, the branch with less threads or the one with more. There is no way to know in advance, and different GPUs might execute the code differently (and will). The same GPU might even change its execution with different driver versions.

In case of your spinlock, your GPU (and its driver, and the compiler version you are currently using) most probably go for the flat branching strategy. This means that both branches are executed by all threads of a warp, so there basically is no lock at all.

If you change the code (or add the [branch] attribute before the loop) you can force the dynamic branching flow. But this would not solve your issues. In the particular case of a spinlock, what you are asking the GPU to do is to shut off all threads but one. And this is not exactly what the GPU will want to do. The GPU will try to do the opposite, and shut the only thread that evaluates the condition differently. This will indeed lead to less divergence and increase performance…but in your case it will shut off the only thread that is not in an endless loop. So, you might get a full wavefront of threads locked in an endless loop, because the only one that might unlock the loop...is sleeping. Your spinlock has effectively become a deadlock.

Now, in your particular machine, the program might even run fine. But you have exactly zero guarantees that the program will run on other machines, or even with different driver versions. You update the driver and boom, you program suddenly hits a GPU Timeout and crashes.

The best suggestion about spinlocks in GPUs is….don't use them. Ever.

Now let's get back to you Y problem.

What you really need is a way to compute a sum of data in a large 2 dimensions array. So what you are really looking for is a good reduction algorithm. There are some on the internet, or you can code your own, depending on your needs.

I will just add a few links to get you started, if you need them.

A Digression on Divergence

NVIDIA - GPU Technology Conference 2010 Slides

Goddeke - Introductory tutorial

Donovan - GPU Parallel Scan

Barlas - Multicore and GPU programming