Search code examples
numpycudagpupycuda

How to relate kernel input data structure in CUDA kernel function with parameter input in pycuda


I am writing a cuda kernel to convert rgba image to gray scale image in pycuda, here is the PyCUDA code:

import numpy as np
import matplotlib.pyplot as plt
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
kernel = SourceModule("""
#include <stdio.h>
__global__ void rgba_to_greyscale(const uchar4* const rgbaImage,
                   unsigned char* const greyImage,
                   int numRows, int numCols)
{
  int y = threadIdx.y+ blockIdx.y* blockDim.y;
  int x = threadIdx.x+ blockIdx.x* blockDim.x;
  if (y < numCols && x < numRows) {
    int index = numRows*y +x;
    uchar4 color = rgbaImage[index];
    unsigned char grey = (unsigned char)(0.299f*color.x+ 0.587f*color.y + 
    0.114f*color.z);
    greyImage[index] = grey;
 }
}
""")

However, the problem is how to relate uchar4* to numpy array. I know can modify my kernel function to accept int* or float*, and make it work. But I just wonder how to make the above kernel function to work in pycuda.

Below is host code.

def gpu_rgb2gray(image):
    shape = image.shape
    n_rows, n_cols, _ = np.array(shape, dtype=np.int)
    image_gray = np.empty((n_rows, n_cols), dtype= np.int)
    ## HERE is confusing part, how to rearrange image to match unchar4* ??
    image = image.reshape(1, -1, 4)
    # Get kernel function
    rgba2gray = kernel.get_function("rgba_to_greyscale")
    # Define block, grid and compute
    blockDim = (32, 32, 1) # 1024 threads in total
    dx, mx = divmod(shape[1], blockDim[0])
    dy, my = divmod(shape[0], blockDim[1])
    gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
    # Kernel function
    # HERE doesn't work because of mismatch
    rgba2gray (
        cuda.In(image), cuda.Out(image_gray), n_rows, n_cols,
        block=blockDim, grid=gridDim)
    return image_gray

Anyone have any ideas? Thanks!


Solution

  • The gpuarray class has native support for CUDA's built in vector types (including uchar4).

    So you can create as gpuarray instance with the correct dtype for the kernel, and copy the host image to that gpuarray using buffers, then use the gpuarray as the kernel input argument. As an example (and if I understood your code correctly), something like this should probably work:

    import pycuda.gpuarray as gpuarray
    
    ....
    
    def gpu_rgb2gray(image):
        shape = image.shape
        image_rgb = gpuarray.empty(shape, dtype=gpuarray.vec.uchar4)
        cuda.memcpy_htod(image_rgb.gpudata, image.data)
        image_gray = gpuarray.empty(shape, dtype=np.uint8)
    
        # Get kernel function
        rgba2gray = kernel.get_function("rgba_to_greyscale")
        # Define block, grid and compute
        blockDim = (32, 32, 1) # 1024 threads in total
        dx, mx = divmod(shape[1], blockDim[0])
        dy, my = divmod(shape[0], blockDim[1])
        gridDim = ((dx + (mx>0)), (dy + (my>0)), 1)
        rgba2gray ( image_rgb, image_gray, np.int32(shape[0]), np.int32(shape[1]), block=blockDim, grid=gridDim)
    
        img_gray = np.array(image_gray.get(), dtype=np.int)
    
        return img_gray
    

    this would take an image of 32 bit unsigned integers and copy them to an array of uchar4 on the GPU and then upcast the resulting array of uchar back to integers on the device.