Search code examples
cudapycuda

Calculating indices for nested loops in CUDA


I'm trying to learn CUDA and I'm a bit confused about calculating thread indices. Let's say I have this loop I'm trying to parallelize:

...
for(int x = 0; x < DIM_x; x++){
    for(int y = 0; y < DIM_y; y++){
        for(int dx = 0; dx < psize; dx++){
            array[y*DIM_x + x + dx] += 1;
        }
    }
}

In PyCUDA, I set:

block = (8, 8, 8)
grid = (96, 96, 16)

Most of the examples I've seen for parallelizing loops calculate thread indices like this:

int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int dx = blockIdx.z * blockDim.z + threadIdx.z;

if (x >= DIM_x || y >= DIM_y || dx >= psize)
    return;

atomicAdd(&array[y*DIM_x + x + dx], 1)

DIM_x = 580, DIM_y = 550, psize = 50

However, if I print x, I see that multiple threads with the same thread Id are created, and the final result is wrong.

Instead, if I use this (3D grid of 3D blocks):

int blockId = blockIdx.x + blockIdx.y * gridDim.x
              + gridDim.x * gridDim.y * blockIdx.z;

int x = blockId * (blockDim.x * blockDim.y * blockDim.z)
        + (threadIdx.z * (blockDim.x * blockDim.y))
        + (threadIdx.y * blockDim.x) + threadIdx.x;

It fixes the multiple same thread Ids problem for x, but I'm not sure how I'd parallelize y and dx.

If anyone could help me understand where I'm going wrong, and show me the right way to parallelize the loops, I'd really appreciate it.


Solution

  • However, if I print x, I see that multiple threads with the same thread Id are created, and the final result is wrong.

    It would be normal for you to see multiple threads with the same x thread ID in a multi-dimensional grid, as it would also be normal to observe many iterations of the loops in your host code with the same x value. If the result is wrong, it has nothing to do with any of the code you have shown, viz:

    #include <vector>
    #include <thrust/device_vector.h>
    #include <thrust/copy.h>
    #include <assert.h>
    
    void host(int* array, int DIM_x, int DIM_y, int psize)
    {
        for(int x = 0; x < DIM_x; x++){
            for(int y = 0; y < DIM_y; y++){
                for(int dx = 0; dx < psize; dx++){
                    array[y*DIM_x + x + dx] += 1;
                }
            }
        }
    }
    
    
    __global__
    void kernel(int* array, int DIM_x, int DIM_y, int psize)
    {
        int x = blockIdx.x * blockDim.x + threadIdx.x;
        int y = blockIdx.y * blockDim.y + threadIdx.y;
        int dx = blockIdx.z * blockDim.z + threadIdx.z;
    
        if (x >= DIM_x || y >= DIM_y || dx >= psize)
            return;
    
        atomicAdd(&array[y*DIM_x + x + dx], 1);
    }
    
    int main()
    {
        dim3 block(8, 8, 8);
        dim3 grid(96, 96, 16);
    
        int DIM_x = 580, DIM_y = 550, psize = 50;
    
        std::vector<int> array_h(DIM_x * DIM_y * psize, 0);
        std::vector<int> array_hd(DIM_x * DIM_y * psize, 0);
        thrust::device_vector<int> array_d(DIM_x * DIM_y * psize, 0);
    
        kernel<<<grid, block>>>(thrust::raw_pointer_cast(array_d.data()), DIM_x, DIM_y, psize);
        host(&array_h[0], DIM_x, DIM_y, psize);
    
        thrust::copy(array_d.begin(), array_d.end(), array_hd.begin());
        cudaDeviceSynchronize();
    
        for(int i=0; i<DIM_x * DIM_y * psize; i++) {
            assert( array_h[i] == array_hd[i] ); 
        }
    
        return 0;
    }
    

    which when compiled and run

    $ nvcc -arch=sm_52 -std=c++11 -o looploop loop_the_loop.cu 
    $ cuda-memcheck ./looploop 
    ========= CUDA-MEMCHECK
    ========= ERROR SUMMARY: 0 errors
    

    emits no errors and passes the check of all elements against the host code in your question.

    If you are getting incorrect results, it is likely that you have a problem with initialization of the device memory before running the kernel. Otherwise I fail to see how incorrect results could be emitted by the code you have shown.

    In general, performing a large number of atomic memory transactions, as your code does, is not the optimal way to perform computation on the GPU. Using non-atomic transactions would probably need to rely on other a priori information about the structure of the problem (such as a graph decomposition or a precise description of the write patterns of the problem).