Launch Single Kernel on problem space vs Launch same kernel, multiple times on smaller problem spaces

Recently I was asked to maintain an old image processing project(5 year old) at my company and It uses openCL.

There is piece of code which works like below

**if (oneKernelFlag == true)

launch a gamma correction kernel on the whole image

else

break the image into grids(ex:- 2*2)
for loop (....) // iterate for each grid 
    launch the same gamma correction kernel on each grid**

Similar kind of logic is used for applying kernels in few other functions. The oneKernelFlag is hardcoded and project is built for each hardware product.

I noticed that execution is way faster when we launch single kernel (oneKernelFlag == true) compared to multiple kernel launch , almost 30% reduction in timing.

Now, I am confused what is the use of launching multiple same kernels on smaller problem spaces? When is this useful?

Please help

The original developer and documentation are unavailable I couldnot find concrete details online.

Solution

Launching a kernel for image processing multiple times on small image regions used to have caching benefits. For older GPUs and iGPUs, only a small part of the image fits in L2/L3 cache, and when the kernel accesses the same pixel colors repeatedly, they may be processed faster in L2/L3 cache instead of slower RAM/VRAM.

However, dispatching multiple small-region kernel calls comes with extra latency/overhead for each dispatch. Modern GPUs have much larger L2/L3 cache and can fit images entirely, and the GPU scheduler is better aware of cacheing; then a single dispatch over the entire image is faster.