I have a pyopencl program that make a long calculation (~3-5 hours per run). I have several kernels started one by one in cycle. So I have something like this:
prepare_kernels_and_data()
for i in range(big_number): # in my case big_number is 400000
load_data_to_device(i) # ~0.0002s
run_kernel1(i) # ~0.0086s
run_kernel2(i) # ~0.00028s
store_data_from_device(i) # ~0.0002s
I measured time and I got following:
time
command)I'd like to know:
I know that overhead is depend on program and I know that python isn't as fast as pure C or CPP. But I believe that when I move all my heavy calculations to OpenCL kernels I can loose no more than 5-7%. Please correct me if I wrong.
P.S. AMD OpenCL, AMD GPU
How do you measure the OCL time? Using only something like:
my_event.profile.end - my_event.profile.start
If it’s the case you can also take another metric like that:
my_event.profile.start - my_event.profile.queued
This metric measure the time spent in the user application as well as in the runtime before execution hence the overhead. This metric is suggested in the AMD programing guide in section 4.4.1.
They also give a warning about profiling explaining that commands can be sent by batch and therefore
Commands submitted as batch report similar start times and the same end time.
If I well recall, NVIDIA streams commands. But in any case you can use that to reduce the overhead. For instance instead of having:
Cl_prog.kernel1(…).wait()
Cl_prog.kernel2(…).wait()
You could do something like:
Event1 = Cl_prog.kernel1(…)
Event2 = Cl_prog.kernel2(…)
Event1.wait()
Event2.wait()
And so on.
But I digress; now to answer specifically to your questions here are some input taken from the same section I mentioned above (It's from AMD but I guess it should be pretty much the same for NVIDIA):
"For CPU devices, the kernel launch time is fast (tens of µs), but for discrete GPU devices it can be several hundreds µs"
See quote above