Calculating distances between points using shared memory

I'm trying to calculate the distance (metric weighted) between all points. To get a speed up, I am doing this on a GPU and through CUDA and numba since I think it's more readable and easier to use.

I have two 1d arrays of 1d points and want to calculate the distance between all points in the same array and the distance between all points between both arrays. I've written two CUDA kernels, one just using global memory, which I have verified gives the correct answer using CPU code. This is it.

@cuda.jit
def gpuSameSample(A,arrSum):
    tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    temp = A[tx]
    tempSum = 0.0
    for i in range(tx+1,A.size):
        distance = (temp - A[i])**2
        tempSum +=  math.exp(-distance/sigma**2)
    arrSum[tx] = tempSum

I am now trying to optimize this further by using shared memory. This is what I have so far.

@cuda.jit
def gpuSharedSameSample(A,arrSum):
    #my block size is equal to 32                                                                                                                                                                           
    sA = cuda.shared.array(shape=(tpb),dtype=float32)
    bpg = cuda.gridDim.x
    tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
    count = len(A)
    #loop through block by block                                                                                                                                                                            
    tempSum = 0.0
    #myPoint = A[tx]                                                                                                                                                                                        

    if(tx < count):
        myPoint = A[tx]
        for currentBlock in range(bpg):

    #load in a block to shared memory                                                                                                                                                                   
            copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
            if(copyIdx < count):
                sA[cuda.threadIdx.x] = A[copyIdx]
        #syncthreads to ensure copying finishes first                                                                                                                                                       
            cuda.syncthreads()


            if((tx < count)):
                for i in range(cuda.threadIdx.x,cuda.blockDim.x):
                    if(copyIdx != tx):
                        distance = (myPoint - sA[i])**2
                        tempSum += math.exp(-distance/sigma**2)
                               
 #syncthreads here to avoid race conditions if a thread finishes earlier                                                                                                                             
            #arrSum[tx] += tempSum                                                                                                                                                                          
            cuda.syncthreads()
    arrSum[tx] += tempSum

I believe I have been careful about syncing threads but this implementation gives an answer which is always too large (by about 5%). I'm guessing there must be some race condition, but as I understand it, each thread writes to a unique index and the tempSum variable is local to each thread so there shouldn't be any race condition. I'm quite sure that my for loop conditions are correct. Any suggestions would be greatly appreciated. Thanks.

Solution

It's better if you provide a complete code. It should be straightforward to do this with trivial additions to what you have shown - just as I have done below. However there are differences between your two realizations even with a restrictive set of assumptions.

I will assume that:

Your overall data set size is a whole number multiple of the size of your threadblock.
You are launching exactly as many threads in total as the size of your data set.

I'm also not going to try and comment on whether your shared realization makes sense, i.e. should be expected to perform better than the non-shared realization. That doesn't seem to be the crux of your question, which is why are you getting a numerical difference between the 2 realizations.

The primary issue is that your method for selecting which elements to compute the pairwise "distance" in each case is not matching. In the non-shared realization, for every element i in your input data set, you are computing a sum of distances between i and every element greater than i:

for i in range(tx+1,A.size):
               ^^^^^^^^^^^

This selection of items to sum does not match the shared realization:

            for i in range(cuda.threadIdx.x,cuda.blockDim.x):
                if(copyIdx != tx):

There are several issues here, but it should be plainly evident that for each block copied in, a given element at position threadIdx.x is only updating its sum if the target element within the block (of data) is greater than that index. That means as you go through the total data set block-wise, you will be skipping elements in each block. That could not possibly match the non-shared realization. If this is not evident, just select actual values for the range of the for loop. Suppose cuda.threadIdx.x is 5, and cuda.blockDim.x is 32. Then that particular element will only compute a sum for items 6-31 in each block of data, throughout the array.

The solution to this problem is to force the shared realization to line up with the non-shared, in terms of how it is selecting elements to contribute to the running sum.

In addition, in the non-shared realization you are updating the output point only once, and you are doing a direct assignment:

arrSum[tx] = tempSum

In the shared realization, you are still only updating the output point once, however you are not doing a direct assignment. I changed this to match the non-shared:

arrSum[tx] += tempSum

Here is a complete code with those issues addressed:

$ cat t49.py
from numba import cuda
import numpy as np
import math
import time
from numba import float32

sigma = np.float32(1.0)
tpb = 32

@cuda.jit
def gpuSharedSameSample(A,arrSum):
    #my block size is equal to 32                                                                                                                               
    sA = cuda.shared.array(shape=(tpb),dtype=float32)
    bpg = cuda.gridDim.x
    tx = cuda.threadIdx.x + cuda.blockIdx.x *cuda.blockDim.x
    count = len(A)
    #loop through block by block                                                                                                                                
    tempSum = 0.0
    #myPoint = A[tx]                                                                                                                                            

    if(tx < count):
        myPoint = A[tx]
        for currentBlock in range(bpg):

    #load in a block to shared memory                                                                                                                           
            copyIdx = (cuda.threadIdx.x + currentBlock*cuda.blockDim.x)
            if(copyIdx < count): #this should always be true
                sA[cuda.threadIdx.x] = A[copyIdx]
        #syncthreads to ensure copying finishes first                                                                                                           
            cuda.syncthreads()


            if((tx < count)):    #this should always be true
                for i in range(cuda.blockDim.x):
                    if(copyIdx-cuda.threadIdx.x+i > tx):
                        distance = (myPoint - sA[i])**2
                        tempSum += math.exp(-distance/sigma**2)

 #syncthreads here to avoid race conditions if a thread finishes earlier                                                                                        
            #arrSum[tx] += tempSum                                                                                                                              
            cuda.syncthreads()
    arrSum[tx] = tempSum

@cuda.jit
def gpuSameSample(A,arrSum):
    tx = cuda.blockDim.x*cuda.blockIdx.x + cuda.threadIdx.x
    temp = A[tx]
    tempSum = 0.0
    for i in range(tx+1,A.size):
        distance = (temp - A[i])**2
        tempSum +=  math.exp(-distance/sigma**2)
    arrSum[tx] = tempSum

size = 128
threads_per_block = tpb
blocks = (size + (threads_per_block - 1)) // threads_per_block
my_in  = np.ones( size, dtype=np.float32)
my_out = np.zeros(size, dtype=np.float32)
gpuSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
gpuSharedSameSample[blocks, threads_per_block](my_in, my_out)
print(my_out)
$ python t49.py
[ 127.  126.  125.  124.  123.  122.  121.  120.  119.  118.  117.  116.
  115.  114.  113.  112.  111.  110.  109.  108.  107.  106.  105.  104.
  103.  102.  101.  100.   99.   98.   97.   96.   95.   94.   93.   92.
   91.   90.   89.   88.   87.   86.   85.   84.   83.   82.   81.   80.
   79.   78.   77.   76.   75.   74.   73.   72.   71.   70.   69.   68.
   67.   66.   65.   64.   63.   62.   61.   60.   59.   58.   57.   56.
   55.   54.   53.   52.   51.   50.   49.   48.   47.   46.   45.   44.
   43.   42.   41.   40.   39.   38.   37.   36.   35.   34.   33.   32.
   31.   30.   29.   28.   27.   26.   25.   24.   23.   22.   21.   20.
   19.   18.   17.   16.   15.   14.   13.   12.   11.   10.    9.    8.
    7.    6.    5.    4.    3.    2.    1.    0.]
[ 127.  126.  125.  124.  123.  122.  121.  120.  119.  118.  117.  116.
  115.  114.  113.  112.  111.  110.  109.  108.  107.  106.  105.  104.
  103.  102.  101.  100.   99.   98.   97.   96.   95.   94.   93.   92.
   91.   90.   89.   88.   87.   86.   85.   84.   83.   82.   81.   80.
   79.   78.   77.   76.   75.   74.   73.   72.   71.   70.   69.   68.
   67.   66.   65.   64.   63.   62.   61.   60.   59.   58.   57.   56.
   55.   54.   53.   52.   51.   50.   49.   48.   47.   46.   45.   44.
   43.   42.   41.   40.   39.   38.   37.   36.   35.   34.   33.   32.
   31.   30.   29.   28.   27.   26.   25.   24.   23.   22.   21.   20.
   19.   18.   17.   16.   15.   14.   13.   12.   11.   10.    9.    8.
    7.    6.    5.    4.    3.    2.    1.    0.]
$

Note that if either of my two assumptions are violated, your code has other issues.

In the future, I encourage you to provide a short, complete code, as I have shown above. For a question like this, it should not be much additional work. If for no other reason (and there are other reasons), its tedious to force others to write this code from scratch, when you already have it, so as to demonstrate the sensibility of the answer provided.