Recently I was asked to maintain an old image processing project(5 year old) at my company and It uses openCL.
There is piece of code which works like below
**if (oneKernelFlag == true)
launch a gamma correction kernel on the whole image
else
break the image into grids(ex:- 2*2)
for loop (....) // iterate for each grid
launch the same gamma correction kernel on each grid**
Similar kind of logic is used for applying kernels in few other functions. The oneKernelFlag is hardcoded and project is built for each hardware product.
I noticed that execution is way faster when we launch single kernel (oneKernelFlag == true) compared to multiple kernel launch , almost 30% reduction in timing.
Now, I am confused what is the use of launching multiple same kernels on smaller problem spaces? When is this useful?
Please help
The original developer and documentation are unavailable I couldnot find concrete details online.
Launching a kernel for image processing multiple times on small image regions used to have caching benefits. For older GPUs and iGPUs, only a small part of the image fits in L2/L3 cache, and when the kernel accesses the same pixel colors repeatedly, they may be processed faster in L2/L3 cache instead of slower RAM/VRAM.
However, dispatching multiple small-region kernel calls comes with extra latency/overhead for each dispatch. Modern GPUs have much larger L2/L3 cache and can fit images entirely, and the GPU scheduler is better aware of cacheing; then a single dispatch over the entire image is faster.