Numba cuda: Using shared memory to add numbers results in overwriting

I have been trying to add numbers using shared memory so it would be as follows:

Thread 0: Add 1 to shared memory variable sharedMemT[0]
Thread 1: Add 1 to shared memory variable sharedMemT[0]
Synchronize threads and store sharedMemT[0] into output[0]

But the result was... 1??

@cuda.jit()
def add(output):
    sharedMemT = cuda.shared.array(shape=(1), dtype=int32)
    sharedMemT[0] = 0
    cuda.syncthreads()
    
    sharedMemT[0] += 1
    cuda.syncthreads()
    output[0] = sharedMemT[0]

out = np.array([0])
add[1, 2](out)
print(out) # results in [1]

Solution

Congratulations, you have a memory race. Threads 0 and 1 run at the same time, so the results are undefined, both in the operation on the shared memory variable, and in the write back to global memory.

For this to work correctly, you would need to serialize access to the shared memory variable using an atomic memory operation, and then only have one thread write back to global memory:

$ cat atomic.py

import numpy as np
from numba import cuda, int32

@cuda.jit()
def add(output):
    sharedMemT = cuda.shared.array(shape=(1), dtype=int32)
    pos = cuda.grid(1)
    if pos == 0:
        sharedMemT[0] = 0

    cuda.syncthreads()

    cuda.atomic.add(sharedMemT, 0, 1)
    cuda.syncthreads()

    if pos == 0:
        output[0] = sharedMemT[0]

out = np.array([0])
add[1, 2](out)
print(out)

$ python atomic.py
[2]