Pycuda Array Indexing with Threads & Blocks

I'm trying to write a cuda histogram function for use with Pycuda. The code seems to be iterating through more elements than are in the size of the array I'm passing in. To rule out errors in the bin computation, I created a very simple kernel where I pass a 2d array and add 1 to the first bucket of the histogram for each element processed. I am consistently getting more elements than are in my 2d array.

The output should be [size_of_2d_array, 0,0,0].

I'm running on Ubuntu 15.04, python 2.7.9. When I try examples written by other people, they seem to work properly.

What am I doing wrong?

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

#Make the kernel
histogram_kernel = """
__global__ void kernel_getHist(unsigned int* array,unsigned int size, unsigned int lda, unsigned int* histo, float buckets)
{

    unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
    unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
    unsigned int tid = y + lda * x;


    if(tid<size){
        //unsigned int value = array[tid];

        //int bin = floor(value * buckets);

        atomicAdd(&histo[0],1);
    }
}
"""
mod = SourceModule(histogram_kernel)


#2d array to analyze
a = np.ndarray(shape=(2,2))
a[0,0] = 1
a[0,1] =2 
a[1,0] = 3
a[1,1] = 4


#histogram stuff, passing but not using right now
max_val = 4
num_bins = np.uint32(4)
bin_size = 1 / np.uint32(max_val / num_bins)

#send array to the gpu
a = a.astype(np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)

#send the histogram to the gpu
a_hist = np.ndarray([1,num_bins])
a_hist = a_hist.astype(np.uint32)
a_hist = a_hist * 0
d_hist = cuda.mem_alloc(a_hist.nbytes)
cuda.memcpy_htod(d_hist, a_hist)

#get the function
func = mod.get_function('kernel_getHist')

#get size & lda
a_size = np.uint32(a.size)
a_lda = np.uint32(a.shape[0])

#print size & lda to check
print(a_lda)
print(a_size)

#run function
func(a_gpu, a_size, a_lda,  d_hist, bin_size, block=(16,16,1))

#get histogram back
cuda.memcpy_dtoh(a_hist, d_hist)

#print the histogram
print a_hist
print a

This code outputs the following:

2
4
[[6 0 0 0]]
[[ 1.  2.]
 [ 3.  4.]]

But, it should output:

2
4
[[4 0 0 0]]
[[ 1.  2.]
 [ 3.  4.]]

The histogram has too many elements, which leads me to believe that I'm doing something wrong with tid and size.

Any thoughts?

Solution

The problem here is that you are not calculating unique values of tid within the kernel. If you do some simple arithmetic on a piece of paper you should get this for blockDim.x = blockDim.y = 16 and lda = 2:

x   y   tid 
0   0   0
1   0   2
0   1   1
1   1   3
0   2   2
0   3   3
..  ..  ..

Note the last two are duplicate indices. This is why your code returns 6, there are 6 threads which satisfy tid < size for size=4.

You have two choices to fix this. One option is to compute a unique index correctly, something like:

unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int gda = blockDim.y * gridDim.y;
unsigned int tid =  y + gda * x;

should work. Alternatively, applying bounds on each dimension of the input array:

unsigned int y = threadIdx.y + blockDim.y * blockIdx.y;
unsigned int x = threadIdx.x + blockDim.x * blockIdx.x;
unsigned int tid =  y + lda * x;

if ( (x < M) && (y < N) ) {
    atomicAdd(&histo[0], 1);
}

would also probably work.