CUDA atomic argument race conditions

If there is device code structured as follows

Item* prev_entry = array[entry->prev];
prev_entry->next = entry->next;

And it were rewritten as an atomic operation

atomicExch(&(array[entry->prev]->next), entry->next);

is the memory access of array done atomically along with the access of next? There may be other threads that modify entry->prev (as they may be another Item's next value) and if the array access is done non-atomically then entry->prev may change between accessing the array and the execution of the atomic operation on the address next resulting in an incorrect result.

To frame the question more generally, are all operations within an atomic operation's arguments executed atomically?

Solution

is the memory access of array done atomically along with the access of next?

No it is not. If you study the corresponding SASS code, you will discover that the read operation associated with entry->next here:

atomicExch(&(array[entry->prev]->next), entry->next);

is an ordinary read operation, not protected in any way. That read operation puts the atomic "update value" in a register. Another register holds the address to update. The atomic operation works on those registers (returning its result in another register, if relevant).

Here is an example:

$ cat t1983.cu
__global__ void k(int *al, int *d){

  atomicExch(al, d[threadIdx.x]);
}


$ nvcc -c t1983.cu
$ cuobjdump -sass ./t1983.o

Fatbin elf code:
================
arch = sm_52
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_52
                Function : _Z1kPiS_
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                           /* 0x001c7c00fe0007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20] ;                  /* 0x4c98078000870001 */
        /*0010*/         {         MOV R2, c[0x0][0x140] ;                 /* 0x4c98078005070002 */
        /*0018*/                   S2R R4, SR_TID.X         }
                                                                           /* 0xf0c8000002170004 */
                                                                           /* 0x001fc800fec20ff1 */
        /*0028*/                   SHR.U32 R0, R4.reuse, 0x1e ;            /* 0x3828000001e70400 */
        /*0030*/                   ISCADD R4.CC, R4, c[0x0][0x148], 0x2 ;  /* 0x4c18810005270404 */
        /*0038*/                   IADD.X R5, R0, c[0x0][0x14c] ;          /* 0x4c10080005370005 */
                                                                           /* 0x041fc400fe8007b1 */
        /*0048*/                   LDG.E R4, [R4] ;                        /* 0xeed4200000070404 */
        /*0050*/                   MOV R3, c[0x0][0x144] ;                 /* 0x4c98078005170003 */
        /*0058*/                   ATOM.E.EXCH RZ, [R2], R4 ;              /* 0xed810000004702ff */
                                                                           /* 0x001ffc00ffe007ed */
        /*0068*/                   NOP ;                                   /* 0x50b0000000070f00 */
        /*0070*/                   EXIT ;                                  /* 0xe30000000007000f */
        /*0078*/                   BRA 0x78 ;                              /* 0xe2400fffff87000f */
                ..........

First of all, we note that the atomic operation works purely based on registers:

   ATOM.E.EXCH RZ, [R2], R4 ;

The register RZ is the "destination", it is RZ (the always-zero register, acting as a "discard" register) because we are not asking for the return value of the function. The register pair R2,R3 comprises the 64 bit address of the location to atomically "update" (in this case, replace its value), and the replacement value is contained in R4. Working backward we see that R4 was loaded here:

     LDG.E R4, [R4] ;

which is completely separate from the atomic. That is an "ordinary" load. You can push backward to find out that the register pair R4,R5 which contains the load address is populated with the kernel argument from constant memory, offset using the SR_TID.X register (corresponding to threadIdx.x) which makes sense.

Likewise the R2,R3 register pair, containing the address of the atomic update location, is loaded directly from kernel arguments, with no offset, which also makes sense.