Search code examples
cudalockingatomicdeadlockspinlock

Cuda Mutex, why deadlock?


I am trying to implement a atomic based mutex.
I succeed it but I have one question about warps / deadlock.

This code works well.

bool blocked = true;

while(blocked) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;
        
        doCriticJob();

        atomicExch(&mLock, 0);
        blocked = false;
    }
}

But this one doesn't...

while(true) {
    if(0 == atomicCAS(&mLock, 0, 1)) {
        index = mSize++;
        
        doCriticJob();

        atomicExch(&mLock, 0);
        break;
    }
}

I think it's a position of exiting loop. In the first one, exit happens where the condition is, in the second one it happens in the end of if, so the thread wait for other warps finish loop, but other threads wait the first thread as well... But I think I am wrong, so if you can explain me :).


Solution

    1. There are other questions here on mutexes. You might want to look at some of them. Search on "cuda critical section", for example.

    2. Assuming that one will work and one won't because it seemed to work for your test case is dangerous. Managing mutexes or critical sections, especially when the negotiation is amongst threads in the same warp is notoriously difficult and fragile(*). The general advice is to avoid it. As discussed elsewhere, if you must use mutexes or critical sections, have a single thread in the threadblock negotiate for any thread that needs it, then control behavior within the threadblock using intra-threadblock synchronization mechanisms, such as __syncthreads().

    3. This question (IMO) can't really be answered without looking at the way the compiler is ordering the various paths of execution. Therefore we need to look at the SASS code (the machine code). You can use the cuda binary utilities to do this, and will probably want to refer to both the PTX reference as well as the SASS reference. This also means that you need a complete code, not just the snippets you've provided.

    Here's my code for analysis:

    $ cat t830.cu
    #include <stdio.h>
    
    
    __device__ int mLock = 0;
    
    __device__ void doCriticJob(){
    
    }
    
    __global__ void kernel1(){
    
      int index = 0;
      int mSize = 1;
      while(true) {
        if(0 == atomicCAS(&mLock, 0, 1)) {
            index = mSize++;
    
            doCriticJob();
    
            atomicExch(&mLock, 0);
            break;
        }
      }
    }
    
    __global__ void kernel2(){
    
      int index = 0;
      int mSize = 1;
      bool blocked = true;
    
      while(blocked) {
        if(0 == atomicCAS(&mLock, 0, 1)) {
            index = mSize++;
    
            doCriticJob();
    
            atomicExch(&mLock, 0);
            blocked = false;
        }
      }
    }
    int main(){
    
     kernel2<<<4,128>>>();
     cudaDeviceSynchronize();
    }
    

    kernel1 is my representation of your deadlock code, and kernel2 is my representation of your "working" code. When I compile this on linux under CUDA 7 and run on a cc2.0 device (Quadro5000), if I call kernel1 the code will deadlock, and if I call kernel2 (as is shown) it doesn't.

    I use cuobjdump -sass to dump the machine code:

    $ cuobjdump -sass ./t830
    
    Fatbin elf code:
    ================
    arch = sm_20
    code version = [1,7]
    producer = <unknown>
    host = linux
    compile_size = 64bit
    
            code for sm_20
    
    Fatbin elf code:
    ================
    arch = sm_20
    code version = [1,7]
    producer = cuda
    host = linux
    compile_size = 64bit
    
            code for sm_20
                    Function : _Z7kernel1v
            .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
            /*0000*/         MOV R1, c[0x1][0x100];            /* 0x2800440400005de4 */
            /*0008*/         MOV32I R4, 0x1;                   /* 0x1800000004011de2 */
            /*0010*/         SSY 0x48;                         /* 0x60000000c0000007 */
            /*0018*/         MOV R2, c[0xe][0x0];              /* 0x2800780000009de4 */
            /*0020*/         MOV R3, c[0xe][0x4];              /* 0x280078001000dde4 */
            /*0028*/         ATOM.E.CAS R0, [R2], RZ, R4;      /* 0x54080000002fdd25 */
            /*0030*/         ISETP.NE.AND P0, PT, R0, RZ, PT;  /* 0x1a8e0000fc01dc23 */
            /*0038*/     @P0 BRA 0x18;                         /* 0x4003ffff600001e7 */
            /*0040*/         NOP.S;                            /* 0x4000000000001df4 */
            /*0048*/         ATOM.E.EXCH RZ, [R2], RZ;         /* 0x547ff800002fdd05 */
            /*0050*/         EXIT;                             /* 0x8000000000001de7 */
                    ............................
    
    
                    Function : _Z7kernel2v
            .headerflags    @"EF_CUDA_SM20 EF_CUDA_PTX_SM(EF_CUDA_SM20)"
            /*0000*/         MOV R1, c[0x1][0x100];            /* 0x2800440400005de4 */
            /*0008*/         MOV32I R0, 0x1;                   /* 0x1800000004001de2 */
            /*0010*/         MOV32I R3, 0x1;                   /* 0x180000000400dde2 */
            /*0018*/         MOV R4, c[0xe][0x0];              /* 0x2800780000011de4 */
            /*0020*/         MOV R5, c[0xe][0x4];              /* 0x2800780010015de4 */
            /*0028*/         ATOM.E.CAS R2, [R4], RZ, R3;      /* 0x54061000004fdd25 */
            /*0030*/         ISETP.NE.AND P1, PT, R2, RZ, PT;  /* 0x1a8e0000fc23dc23 */
            /*0038*/    @!P1 MOV R0, RZ;                       /* 0x28000000fc0025e4 */
            /*0040*/    @!P1 ATOM.E.EXCH RZ, [R4], RZ;         /* 0x547ff800004fe505 */
            /*0048*/         LOP.AND R2, R0, 0xff;             /* 0x6800c003fc009c03 */
            /*0050*/         I2I.S32.S16 R2, R2;               /* 0x1c00000008a09e84 */
            /*0058*/         ISETP.NE.AND P0, PT, R2, RZ, PT;  /* 0x1a8e0000fc21dc23 */
            /*0060*/     @P0 BRA 0x18;                         /* 0x4003fffec00001e7 */
            /*0068*/         EXIT;                             /* 0x8000000000001de7 */
                    ............................
    
    
    
    Fatbin ptx code:
    ================
    arch = sm_20
    code version = [4,2]
    producer = cuda
    host = linux
    compile_size = 64bit
    compressed
    $
    

    Considering a single warp, with either code, all threads must acquire the lock (via atomicCAS) once, in order for the code to complete successfully. With either code, only one thread in a warp can acquire the lock at any given time, and in order for other threads in the warp to (later) acquire the lock, that thread must have an opportunity to release it (via atomicExch).

    The key difference between these realizations then, lies in how the compiler scheduled the atomicExch instruction with respect to conditional branches.

    Let's consider the "deadlock" code (kernel1). In this case, the ATOM.E.EXCH instruction does not occur until after the one (and only) conditional branch (@P0 BRA 0x18;) instruction. A conditional branch in CUDA code represents a possible point of warp divergence, and execution after warp divergence is, to some degree, unspecified and up to the specifics of the machine. But given this uncertainty, it's possible that the thread that acquired the lock will wait for the other threads to complete their branches, before executing the atomicExch instruction, which means that the other threads will not have a chance to acquire the lock, and we have deadlock.

    If we then compare that to the "working" code, we see that once the ATOM.E.CAS instruction is issued, there are no conditional branches in between that point and the point at which the ATOM.E.EXCH instruction is issued, thus releasing the lock just acquired. Since each thread that acquires the lock (via ATOM.E.CAS) will release it (via ATOM.E.EXCH) before any conditional branching occurs, there isn't any possibility (given this code realization) for the kind of deadlock witnessed previously (with kernel1) to occur.

    (@P0 is a form of predication, and you can read about it in the PTX reference here to understand how it can lead to conditional branching.)

    NOTE: I consider both of these codes to be dangerous, and possibly flawed. Even though the current tests don't seem to uncover a problem with the "working" code, I think it's possible that a future CUDA compiler might choose to schedule things differently, and break that code. It's even possible that compiling for a different machine architecture might produce different code here. I consider a mechanism like this to be more robust, which avoids intra-warp contention entirely. Even such a mechanism, however, can lead to inter-threadblock deadlocks. Any mutex must be used under specific programming and usage limitations.

    (*) Note that this situation is mitigated in Volta (PDF) by independent thread scheduling.