I'm writing a python application that processes a lot of images. The computation speed of the application is important, thus I'm trying to minimize the execution time by writing cupy kernels.
For the sake of simplicity, assume that I have a cupy raw kernel below.
import cupy as cp
add_kernel = cp.RawKernel(r'''
extern "C" __global__
void add_one(float* dimg, float* y) {
int j = threadIdx.x;
int i = blockIdx.x;
int k = blockDim.x;
int tid = k*i+j;
y[tid] = dimg[tid] + 1;
}
''', 'add_one')
if __name__ == '__main__':
h, w = 192, 256
dimg_cp = cp.zeros(shape=(h, w), dtype=cp.float32)
y = cp.zeros(shape=(h, w), dtype=cp.float32)
add_kernel((h,), (w,), (dimg_cp, y))
print(y)
Here, 'add_kernel' simply copies an input matrix and add one to every element of the copied matrix then return it. It works great but I believe the code can be further optimized in terms of execution speed.
According to the link, when the kernel is called for the first time (i.e. not cached), there will be an overhead for compilation.
I want to avoid this compilation time. So I want to ask if there is a way of compiling cp.RawKernel prior to calling the kernel for the first time?
Thanks in advance.
There is currently no explicit way to precompile the kernel without calling it. One easy solution is just calling it once with a small input. Note that the compiled kernel is also cached to a file, so the overhead only exists at the first execution of the script in the environment.