calculating gst_throughput and gld_throughput with nvprof

I got the following problem. I want to measure the gst_efficiency and the gld_efficiency for my cuda application using nvprof. The documentation distributed with cuda 5.0 tells me to generate these using the following formulas for devices with compute capability 2.0-3.0:

gld_efficiency = 100 * gld_requested_throughput / gld_throughput

gst_efficiency = 100 * gst_requested_throughput / gst_throughput

For the required metrics the following formulas are given:

gld_throughput = ((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime

gst_throughput = (l2_subp0_write_requests + l2_subp1_write_requests) * 32 - (l1_local_ld_miss * 128)) / gputime

gld_requested_throughput = (gld_inst_8bit + 2 * gld_inst_16bit + 4 * gld_inst_32bit + 8
* gld_inst_64bit + 16 * gld_inst_128bit) / gputime

gst_requested_throughput = (gst_inst_8bit + 2 * gst_inst_16bit + 4 * gst_inst_32bit + 8
* gst_inst_64bit + 16 * gst_inst_128bit) / gputime

Since no formula is given for the metrics used I assume that these are events which can be counted by nvprof. But some of the events seem not to be available on my gtx 460 (also tried gtx 560 Ti). I pasted the output of nvprof --query-events.

Any ideas what's going wrong or what I'm misinterpreting?

EDIT: I don't want to use CUDA Visual Profiler, since I'm trying to analyse my application for different parameters. I therefore want to run nvprof using multiple parameter configurations, recording multiple events (each one in its one run) and then output the data in tables. I got this automated already and working for other metrics (i.e. instructions issued) and want to do this for load and store efficiency. This is why I'm not interested in solution involving nvvp. By the way, for my application nvvp fails to calculate the metrics required for store-efficiency so it doesn't help my at all in this case.

Solution

I'm glad somebody had the same issue :) I was trying to do the very same thing and couldn't use the Visual Profiler, because I wanted to profile like 6000 different kernels.

The formulas on NVidia site are poorly documented - actually the variables can be:

a) events

b) other Metrics

c) different variables dependent on the GPU you have

However, a LOT of the metrics there have either typos in it or are versed a bit differently in nvprof than they are on the site. Also, there the variables are not tagged, so you can't tell just by looking whether they are a),b) or c). I used a script to grep them and then had to fix it by hand. Here is what I found:

1) "l1_local/global_ld/st_hit/miss" These have "load"/"store" in nvprof instead of "ld"/"st" on site.

2) "l2_ ...whatever... _requests" These have "sector_queries" in nvprof instead of "requests".

3) "local_load/store_hit/miss" These have "l1_" in additionally in the profiler - "l1_local/global_load/store_hit/miss"

4) "tex0_cache_misses" This one has "sector" in it in the profiler - "tex0_cache_sector_misses"

5) "tex_cache_sector_queries" Missing "0" - so "tex0_cache_sector_queries" in the nvprof.

Finally, the variables:

1) "#SM" The number of streaming multiprocessors. Get via cudaDeviceProp.

2) "gputime" Obviously, the execution time on GPU.

3) "warp_size" The size of warp on your GPU, again get via cudaDeviceProp.

4) "max_warps_per_sm" Number of blocks executable on an sm * #SM * warps per block. I guess.

5) "elapsed_cycles" Found this: https://devtalk.nvidia.com/default/topic/518827/computeprof-34-active-cycles-34-counter-34-active-cycles-34-value-doesn-39-t-make-sense-to-/ But still not entirely sure, if I get it.

Hopefully this helps you and some other people who encounter the same problem :)