Odd-even sort: Incorrect results when using multiple blocks in CUDA

I am new to PyCUDA and trying to implement the Odd-even sort using PyCUDA.

I managed to run it successfully on arrays whose size is limited by 2048 (using one thread block), but as soon as I tried to use multiple thread blocks, the result was no longer correct. I suspected this might be a synchronization problem but had no idea how to fix it.

bricksort_src = """
__global__
void bricksort(int *in, int *out, int n){
  int tid = threadIdx.x + (blockIdx.x * blockDim.x);

  if((tid * 2) < n) out[tid * 2] = in[tid *2];
  if((tid * 2 + 1) < n) out[tid * 2 + 1] = in[tid * 2 + 1];
  __syncthreads();


  // odd and even are used for adjusting the index
  // to avoid out-of-index exception
  int odd, even, alter;
  odd = ((n + 2) % 2) != 0;
  even = ((n + 2) % 2) == 0;
  // alter is used for alternating between the odd and even phases
  alter = 0;

  for(int i = 0; i < n; i++){  

    int idx = tid * 2 + alter;
    int adjust = alter == 0 ? odd : even;

    if(idx < (n - adjust)){
      int f, s;
      f = out[idx]; 
      s = out[idx + 1];
      if (f > s){
        out[idx] = s; 
        out[idx + 1] = f;
      }
    }
    __syncthreads();
    alter = 1 - alter;
  }
}
"""
bricksort_ker = SourceModule(source=bricksort_src)
bricksort = bricksort_ker.get_function("bricksort")


np.random.seed(0)
arr = np.random.randint(0,10,2**11).astype('int32')

iar = gpuarray.to_gpu(arr)
oar = gpuarray.empty_like(iar)
n = iar.size
num_threads = np.ceil(n/2)

if (num_threads < 1024):
  blocksize = int(num_threads)
  gridsize = 1
else: 
  blocksize = 1024
  gridsize = int(np.ceil(num_threads / blocksize))

bricksort(iar, oar, np.int32(n),
          block=(blocksize,1,1),
          grid=(gridsize,1,1))

Solution

Assembling comments into an answer:

odd-even sort can't be easily/readily extended beyond a single threadblock (because it requires synchronization) CUDA __syncthreads() only synchronizes at the block level. Without synchronization, CUDA specifies no particular order to thread execution.
for serious sorting work, I recommend a library implementation such as cub. If you want to do this from python I recommend cupy.
CUDA has a sample code that demonstrates odd-even sorting at the block level, but because of the sync issue it chooses a merge method to combine results
it should be possible to write an odd-even sort kernel that only does a single swap, then call this kernel in a loop. The kernel call itself acts as a device-wide synchronization point.
alternatively, it should be possible to do the work in a single kernel launch using cooperative groups grid sync.
none of these methods are likely to be faster than a good library implementation (which won't depend on odd-even sorting to begin with).