OpenCL long kernel execution time

I implement some image processing using OpenCL on GPU. On host program I launch this kernel 4 times, total time of this about 13 ms (on AMD profiler), it is good result, I think, but if I measure kernel execution time on host by QueryPerformanceTimer it shows about 26 ms. clEnqueueNDRangeKernel execution time smaller than 1 ms. Where is 26-13 ms? How to fix it? I launch it on GPU 1: AMD Radeon HD 6900 Series, using AMD SDK 3.0. If I launch the kernel once, but in the kernel I add 4 times cycle, the result is the same.

Solution

clEnqueueNDRangeKernel as the name says it is an "enqueue" call. So it only queues work to a command queue. That does not mean that the work is completed before the call returns, in fact it may have not been even started. The API has probably just packed the work in a tidy structure of commands, and add it to the queue (submit phase).

You have to measure the kernel execution using the event timer (clEvents) with a Profiling enabled queue. That is the real execution time on the device.

Alternatively, it is possible to measure the total "roundtrip" time by measuring from "enqueue" to clFinish. But that will include all the overheads that usually are hided in a pipeline scenario, so normally the first approach is preferred.