How to profile PyCuda code in Linux?

I have a simple (tested) pycuda app and am trying to profile it. I've tried NVidia's Compute Visual Profiler, which runs the program 11 times, then emits this error:

NV_Warning: Ignoring the invalid profiler config option: fb0_subp0_read_sectors
Error : Profiler data file '/home/jguy/proj/gpu/tdbp/pyArch/temp_compute_profiler_0_0.csv' does not contain profiler output.This can happen when:
a) Profiling is disabled during the entire run of the application.
b) The application does not invoke any kernel launches or memory transfers.
c) The application does not release resources (contexts, events, etc.). The program needs to be modified to properly free up all resources before termination.

I also tried running "CUDA_PROFILE python scriptname.py arg1". It created a file containing:

NV_Warning: Ignoring the invalid profiler config option: instructions
# CUDA_PROFILE_LOG_VERSION 2.0
# CUDA_DEVICE 0 GeForce GTX 560 Ti
# CUDA_PROFILE_CSV 1
# TIMESTAMPFACTOR fffff7003e38fec8
gpustarttimestamp,method,gputime,cputime,occupancy

In case it's useful, I also have these environment vars set:

CUDA_PROFILE_CONFIG=temp_cuda_profiler.conf
CUDA_PROFILE_CSV=1
CUDA_PROFILE_LOG=profile.csv
CUDA_PROFILE=1

and temp_cuda_profiler.conf contains

gpustarttimestamp
instructions

Been googling for an hour or so. No luck. Thanks for any insights you can provide!

Solution

When using import pycuda.autoinit, it is important to have pycuda.autoinit.context.detach() at the end of the program. This fixed the problem.