OpenCL Start-End profiling time takes longer than actual duration time

I write an OpenCL Program and I execute my kernel like this

 Loop for MultipleGPU{
 clEnqueueNDRangeKernel(commandQueues[i], kernel[i], 1, null,
        global_work_size, local_work_size, 0, new cl_event[]{userEvent}, events[i]);
 clFlush(commandQueues[i]);
 }

 long before = System.nanoTime();

 // Set UserEvent = Complete so all kernel can start executing
 clSetUserEventStatus(userEvent, CL_COMPLETE);

 // Wait until the work is finished on all command queues
 clWaitForEvents(events.length, events);

 long after = System.nanoTime();

 float totalDurationMs = (after - before) / 1e6f;

 ...profiling each events with CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END...

The userEvent assure that the kernel run at the same time. Source : [Reima's Answer]: How do I know if the kernels are executing concurrently? .

And I'm getting this result from one System with 2 Tesla K20M GPU in it:

 Total duration :37.800076ms
 Duration on device 1 of 2: 38.037186
 Duration on device 2 of 2: 37.85744

Can someone explain to me why the Start-End Profile Time took longer than the total duration time ?

Thank You

Solution

Read this: Timer Accuracy.

You shouldn't trust those system calls to give you the time, typically they have +-1ms accuracy, unless you go deeply to get the CPU cycles (but thats difficult). However the GPU timing is quite precise (at some nanoseconds level), use that instead.

EDIT: If you want to test it (for the sake of pleasure): queue the kernel 1000 times and sum the times of each execution, then compare to the system time. In that case it should never be higher, since the accuracy of the times is far less under the execution time (38 seconds).