Is there a way to profile an OpenCL or a pyOpenCL program?

I am trying to optimize a pyOpenCL program. For this reason I was wondering if there is a way to profile the program and see where most of the time is needed for.

Do you have any idea how to approach this problem?

Thanks in advance
Andi

EDIT: For example nvidias nvprof for CUDA would do the trick for pyCuda, however, not for pyOpenCL.

Solution

Ok,
I have figured out a way: The Cuda Toolkit 3.1 offers a profiling for openCL (higher versions will not). From this package use the compute visual profiler which is the (computeprof.exe). It is available for windows and linux here and can be installed alongside a new Cuda Toolkit.
It looks like this:

Timings Total time histogram Hist 2 Hist 3

I hope this helps someone else too.