python parallel-processing cuda synchronization pycuda

How to use syncthreads in CUDA for a scan algorithm (Hillis-Steele)

I'm trying to implement a scan algorithm (Hillis-Steele) and I'm having some trouble understanding how to do it properly on CUDA. This is a minimal example using pyCUDA:

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
from pycuda.compiler import SourceModule

#compile cuda code
mod = SourceModule('''
__global__ void scan(int * addresses){
    for(int idx=1; idx <= threadIdx.x; idx <<= 1){
        int new_value = addresses[threadIdx.x] + addresses[threadIdx.x - idx];
        __syncthreads();
        addresses[threadIdx.x] = new_value;
    }    
}
''')
func = mod.get_function("scan")

#Initialize an array with 1's
addresses_h = np.full((896,), 1, dtype='i4')
addresses_d = cuda.to_device(addresses_h)

#Launch kernel and copy back the result
threads_x = 896
func(addresses_d, block=(threads_x, 1, 1), grid=(1, 1))
addresses_h = cuda.from_device(addresses_d, addresses_h.shape, addresses_h.dtype)

# Check the result is correct
for i, n in enumerate(addresses_h):
    assert i+1 == n

My question is about __syncthreads(). As you can see, I'm calling __syncthreads() inside the for loop and not every thread will execute that code the same number of times:

ThreadID - Times it will execute for loop
       0 :  0 times
       1 :  1 times
   2-  3 :  2 times
   4-  7 :  3 times
   8- 15 :  4 times
  16- 31 :  5 times
  32- 63 :  6 times
  64-127 :  7 times
 128-255 :  8 times
 256-511 :  9 times
 512-896 : 10 times

There can be threads in the same warp with different number of calls to syncthreads. What will happen in that case? How can threads that are not executing the same code be synchronized?

In the sample code, we start with an array full of 1's and we get in the output the index+1 as value for each position. It is computing the correct answer. Is it "by chance" or the code is correct?

If this is not a proper use of syncthreads, how could I implement such algoritm using cuda?

Solution

If this is not a proper use of syncthreads, how could I implement such algoritm using cuda?

One typical approach is to separate the conditional code from the __syncthreads() calls. Use the conditional code to determine what threads will participate.

Here's a simple transformation of your posted code, that should give the same result, without any violations (i.e. all threads will participate in every __syncthreads() operation):

mod = SourceModule('''
__global__ void scan(int * addresses){
    for(int i=1; i < blockDim.x; i <<= 1){
        int new_value;
        if (threadIdx.x >= i) new_value = addresses[threadIdx.x] + addresses[threadIdx.x - i];
        __syncthreads();
        if (threadIdx.x >= i) addresses[threadIdx.x] = new_value;
    }    
}
''')

I'm not suggesting this is a complete and proper scan, or that it is optimal, or anything of the sort. I'm simply showing how your code can be transformed to avoid the violation inherent in what you have.

If you want to learn more about scan methods, this is one source. But if you actually need a scan operation, I would suggest using thrust or cub.