A simple question on CUDA threads in numba

It's a very beginner oriented question. I have been working on regular python threads and C threads and learnt that I can create threads that run a specific function and they use semaphores and other sync primitives.

But, I am currently trying to learn Cuda using the numba's python based compiler. I have written the following code.

from numba import cuda
import numpy as np

@cuda.jit
def image_saturate(data):
    pos_x, pos_y = cuda.grid(2)

    if (pos_x, pos_y) <= data.shape:
        data[pos_x, pos_y] = 1

if __name__ == "__main__":
    image_quality = (128, 72)
    image = np.zeros(image_quality)

    thread_size = 32
    block_size = image_quality

    image_saturate[block_size, thread_size](image)

    print(image)

But, the thing that I feel weird is, I can change thread_size as I want and the result is the same - meaning the output is all ones as expected. But, the moment I change the the block_size weird things start happening and only that size of the original matrix gets filled with ones - so it's only a partial filling.

Form this I understand that the cuda.grid(2) returns the block coordinates. But, shouldn't I be able to get the actual thread coordinates and the block coordinates as well?

I am terribly new and I can't find any resources to learn online. It would be great if anyone can answer my question and also provide and resources for learning Cuda using Numba.

Solution

Form this I understand that the cuda.grid(2) returns the block coordinates.

That's not the case. That statement returns a fully-qualified 2D thread index. The range of the returned values will extend to the product of the block coordinates limit and the thread coordinates limit.

In CUDA, the grid dimension parameter for a kernel launch (the one you are calling block_size) specifies the grid dimension in terms of number of blocks (in each direction). The block dimension parameter for a kernel launch (the one you are calling thread_size) specifies the size of each block in the grid, in terms of the number of threads per block.

Therefore the total number of threads launched is equal to the product of the grid dimension(s) and the the block dimension(s). The total would be the product of all those things, in all dimensions. The total per dimension would be the product of the grid dimension in that direction and the block dimension in that direction.

So you have a questionable design choice, in that you have an image size and you are setting the grid dimension equal to the image size. This could only be sensible if you had only 1 thread per block. As you will discover by looking at any proper numba CUDA code (such as the one here) a typical approach is to divide the total desired dimension (in this case, the image size or dimension(s)), by the number of threads per block, to get the grid dimension.

When we do so, the cuda.grid() statement in your kernel code will return a tuple that has sensible ranges. In your case, it would return tuples to threads that correctly go from 0..127 in x, and 0..71 in y. The problem you have at the moment is that the cuda.grid() statement can return tuples that range from 0..((128*32)-1) in x, and that is unnecessary.

Of course, the goal of your if statement is to prevent out-of-bounds indexing, however the test of <= does not look right to me. This is the classical computer science off-by-1 error. Threads whose indices happen to match the limit returned by shape should be excluded.

But, the moment I change the the block_size weird things start happening and only that size of the original matrix gets filled with ones - so it's only a partial filling.

It's really not clear what your expectations are here. Your kernel design is such that each thread populates (at most) one output point. Therefore sensible grid sizing is to match the grid (the total threads in x, and the total threads in y) to the image dimensions. If you follow the above recommendations for grid sizing calculations, and then set your grid size to something less than your image size, I would expect that portions of your output image would not be populated. Don't do that. Or if you must do that, employ a grid-stride loop kernel design.

Having said all that, the following is how I would rewrite your code:

from numba import cuda
import numpy as np

@cuda.jit
def image_saturate(data):
    pos_x, pos_y = cuda.grid(2)

    if pos_x  < data.shape[0] and pos_y < data.shape[1]:
        data[pos_x, pos_y] = 1

if __name__ == "__main__":
    image_x = 128
    image_y = 72
    image_quality = (image_x, image_y)
    image = np.zeros(image_quality)
    thread_x = 32
    thread_y = 1
    thread_size = (thread_x, thread_y)
    block_size = ((image_x//thread_x) + 1, (image_y//thread_y) + 1) # "lazy" round-up

    image_saturate[block_size, thread_size](image)

    print(image)

it appears to run correctly for me. If you now suggest that what you want to do is to arbitrary modify the block_size variable e.g.:

block_size = (5,5)

and make no other changes, and expect the output image to be fully populated, I would say that is not a sensible expectation. I have no idea how that could be sensible, so I will just say that CUDA doesn't work that way. If you wish to "decouple" the data size from the grid size, the canonical way to do it is the grid stride loop as already discussed.

I've also removed the tuple comparison. I don't think it is really germane here. If you still want to use the tuple comparison, that should work exactly as you would expect based on python. There isn't anything CUDA specific about it.