Bench marking of PyOpenCL programs

I have been trying to benchmark my FFT program on a GPU using PyOpenCL. I see completely different results while using 'profiling' of OpenCL and 'time' module of python. To use profiling, I do something like this,

queue = cl.CommandQueue(ctx,properties=cl.command_queue_properties.PROFILING_ENABLE)
<other codes>
for i in range(N):
  events.append(prg3.butterfly(queue,(len(twid),),None,twid_dev,<buffers>))
  events[i].wait()
for i in range(N):
  elapsed = elapsed + 1e-9*(event[i].profile.end - event[i].profile.start)
print elapsed

While time module can be used like this,

k=time.time()
for i in range(N):
  event = prg3.butterfly(queue,(len(twid),),None,twid_dev,<buffers>)
print time.time()-k

Since both of these give totally different results for N=20, ( while answer remains same and correct!), I have the following questions.

What exactly does event profiling do and is it adding the time spent in event.wait() ?
Since answer is same without event.wait() in case 2, is it the right amount of time spent in just executing the Kernel?

Please enlighten me about the right way to benchmark OpenCL programs in python.

Solution

Your second case is only capturing the time taken to enqueue the kernel, not to actually run it. These enqueue kernel calls return as soon as the kernel invocation has been placed in the queue - the kernel will be run asynchronously with your host code. To time the kernel execution as well, just add a call to wait until all enqueued commands have finished:

k=time.time()
for i in range(N):
  event = prg3.butterfly(queue,(len(twid),),None,twid_dev,<buffers>)
queue.finish()
print time.time()-k

Your first case is correctly timing the time spent inside kernel execution, but is unnecessarily blocking the host in-between each kernel invocation. You could just use queue.finish() again once all commands have been enqueued:

for i in range(N):
  events.append(prg3.butterfly(queue,(len(twid),),None,twid_dev,<buffers>))
queue.finish()
for i in range(N):
  elapsed = elapsed + 1e-9*(event[i].profile.end - event[i].profile.start)
print elapsed

Both of these approaches should return almost identical times.