I am trying to optimize a pyOpenCL program. For this reason I was wondering if there is a way to profile the program and see where most of the time is needed for.
Do you have any idea how to approach this problem?
Thanks in advance
Andi
EDIT: For example nvidias nvprof for CUDA would do the trick for pyCuda, however, not for pyOpenCL.
Ok,
I have figured out a way: The Cuda Toolkit 3.1
offers a profiling for openCL (higher versions will not). From this package use the compute visual profiler
which is the (computeprof.exe)
. It is available for windows and linux here and can be installed alongside a new Cuda Toolkit.
It looks like this:
I hope this helps someone else too.