opencl overhead pyopencl overhead-minimization

Measuring and minimization of OpenCL overhead

I have a pyopencl program that make a long calculation (~3-5 hours per run). I have several kernels started one by one in cycle. So I have something like this:

prepare_kernels_and_data()

for i in range(big_number): # in my case big_number is 400000
  load_data_to_device(i)    # ~0.0002s
  run_kernel1(i)            # ~0.0086s
  run_kernel2(i)            # ~0.00028s
  store_data_from_device(i) # ~0.0002s

I measured time and I got following:

System time is 4:30 hours (measured by linux time command)
Pure opencl event-based timing is 3:30 hours (load+calculate+store)

I'd like to know:

How big is minimal overhead for OpenCL program? In my case it is like 35%
Should I trust event-based timings?
Does enabling profiling add some significant time to whole program execution time?

I know that overhead is depend on program and I know that python isn't as fast as pure C or CPP. But I believe that when I move all my heavy calculations to OpenCL kernels I can loose no more than 5-7%. Please correct me if I wrong.

P.S. AMD OpenCL, AMD GPU

Solution

How do you measure the OCL time? Using only something like:

my_event.profile.end - my_event.profile.start

If it’s the case you can also take another metric like that:

my_event.profile.start - my_event.profile.queued

This metric measure the time spent in the user application as well as in the runtime before execution hence the overhead. This metric is suggested in the AMD programing guide in section 4.4.1.
They also give a warning about profiling explaining that commands can be sent by batch and therefore

Commands submitted as batch report similar start times and the same end time.

If I well recall, NVIDIA streams commands. But in any case you can use that to reduce the overhead. For instance instead of having:

Cl_prog.kernel1(…).wait()
Cl_prog.kernel2(…).wait()

You could do something like:

Event1 =   Cl_prog.kernel1(…)
Event2 = Cl_prog.kernel2(…)
Event1.wait()
Event2.wait()

And so on.
But I digress; now to answer specifically to your questions here are some input taken from the same section I mentioned above (It's from AMD but I guess it should be pretty much the same for NVIDIA):

"For CPU devices, the kernel launch time is fast (tens of µs), but for discrete GPU devices it can be several hundreds µs"
See quote above
"Enabling profiling on a command queue adds approximately 10 μs to 40 μs overhead to all clEnqueue calls".