Search code examples
cudanvprof

local cache hit metric in cuda profiler


For some CUDA application profilings, I see that the value of local hit rate (local_hit_rate metric) is 0%.

I want to distinguish the following concepts with that value.

  1. The application has no access to the local cache.

  2. All accesses to local cache were misses.

How can I find the answer? Since the value of inst_compute_ld_st, ldst_issued and ldst_executed are non-zero, is it fine to discard the first question? Or there is something else?

The device is M2000 which is CC5.3 CC5.2


Solution

  • nvprof supports both events (raw counters) and metrics. These can be queried using the following commands: nvprof --query-events nvprof --query-metrics

    CC5./6. Local Memory Metircs

    • local_load_transactions_per_request: Average number of local memory load transactions performed for each local memory load
    • local_store_transactions_per_request: Average number of local memory store transactions performed for each local memory store
    • local_load_transactions: Number of local memory load transactions
    • local_store_transactions: Number of local memory store transactions
    • local_hit_rate: Hit rate for local loads and stores
    • local_memory_overhead: Ratio of local memory traffic to total memory traffic between the L1 and L2 caches expressed as percentage
    • local_load_throughput: Local memory load throughput
    • local_store_throughput: Local memory store throughput
    • inst_executed_local_loads: Warp level instructions for local loads
    • inst_executed_local_stores: Warp level instructions for local stores
    • l2_local_load_bytes: Bytes read from L2 for misses in Unified Cache for local loads
    • l2_local_global_store_bytes: Bytes written to L2 from Unified Cache for local and global stores. This does not include global atomics.
    • local_load_requests: Total number of local load requests from Multiprocessor
    • local_store_requests: Total number of local store requests from Multiprocessor

    local__request is the number of instructions executed to local memory via generic address space or local address space. On CC5./6.* I do not recall if this includes fully predicated of instructions.

    local_*_transactions is the number of cache accesses that occurred due to the size (32-bit, 64-bit, ...) of the request and the address divergence of the request. If this is non-zero then local memory was accessed.

    l2_local_*_bytes is the number of bytes of data loaded/stored to the L2 cache.