I have been trying to benchmark my FFT program on a GPU using PyOpenCL. I see completely different results while using 'profiling' of OpenCL and 'time' module of python. To use profiling, I do something like this,
queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
<other codes>
for i in range(N):
events.append(prg3.butterfly(queue,(len(twid),),None,twid_dev,<buffers>))
events[i].wait()
for i in range(N):
elapsed = elapsed + 1e-9*(event[i].profile.end - event[i].profile.start)
print elapsed
While time module can be used like this,
k=time.time()
for i in range(N):
event = prg3.butterfly(queue,(len(twid),),None,twid_dev,<buffers>)
print time.time()-k
Since both of these give totally different results for N=20, ( while answer remains same and correct!), I have the following questions.
Please enlighten me about the right way to benchmark OpenCL programs in python.
Your second case is only capturing the time taken to enqueue the kernel, not to actually run it. These enqueue kernel calls return as soon as the kernel invocation has been placed in the queue - the kernel will be run asynchronously with your host code. To time the kernel execution as well, just add a call to wait until all enqueued commands have finished:
k=time.time()
for i in range(N):
event = prg3.butterfly(queue,(len(twid),),None,twid_dev,<buffers>)
queue.finish()
print time.time()-k
Your first case is correctly timing the time spent inside kernel execution, but is unnecessarily blocking the host in-between each kernel invocation. You could just use queue.finish()
again once all commands have been enqueued:
for i in range(N):
events.append(prg3.butterfly(queue,(len(twid),),None,twid_dev,<buffers>))
queue.finish()
for i in range(N):
elapsed = elapsed + 1e-9*(event[i].profile.end - event[i].profile.start)
print elapsed
Both of these approaches should return almost identical times.