caching cuda gpgpu nsight compute-capability

Cache behaviour in Compute Capability 7.5

These are my assumptions:

There are two types of loads, cached and uncached. In the first one, the traffic goes through L1 and L2, while in the second one, the traffic goes only through L2.
The default behaviour in Compute Capability 6.x and 7.x are cached accesses.
A L1 cache line is 128 bytes and a L2 cache line is 32 bytes, so for every L1 transaction generated, there should be four L2 transactions (one per each sector.)
In Nsight, a SM->TEX Request means a warp-level instruction merged from 32 threads. L2->TEX Returns and TEX->SM Returns is a measure of how many sectors are transfered between each memory unit.

Assuming Compute Capability 7.5, these are my questions:

The third assumption seems to imply that L2->TEX Returns should always be a multiple of four for global cached loads, but that's not always the case. What is happening here?
Is there still a point in marking pointers with const and __restrict__ qualifiers? That used to be a hint to the compiler that the data is read-only and therefore can be cached in L1/texture cache, but now all data is cached there, both read-only and not read-only.
From my fourth assumption, I would think that whenever TEX->SM Returns is greater than L2->TEX Returns, the difference comes from cache hits. That's because when there's a cache hit, you get some sectors read from L1, but none from L2. Is this true?

Solution

CC 6.x/7.x

L1 cache line size is 128 bytes divided into 4 32 byte sectors. On a miss only addressed sectors will be fetched from L2.
L2 cache line size is 128 bytes divided into 4 32 byte sectors.
- CC 7.0 (HBM) 64B promotion is enabled. If there is a miss to the lower 64 bytes of the cache line the lower 64 bytes will be fetched from DRAM. If there is a miss to the upper 64 bytes of the cache line then the upper 64 bytes will be fetched.
- CC 6.x/7.5 only accessed 32B sectors will be fetched from DRAM.
In terms of L1 cache policy
- CC 6.0 has load caching enabled by default
- CC 6.x has load caching disabled by default - see programming guide
- CC 7.x has load caching enabled by default - see PTX for details on cache control

In Nsight Compute the term requests varies between 6.x and 7.x.

For 5.x-6.x the number of requests per instruction varied by the type of operation and the width of the data. For example 32-bit load is 8 threads/request, 64-bit load is 4 threads/request, and 128-bit load is 2 threads/request.
For 7.x requests should be equivalent to instructions unless access pattern has address divergence that causes serialization.

Answering your CC 7.5 Questions

The third assumption seems to imply that L2->TEX Returns should always be a multiple of four for global cached loads, but that's not always the case. What is happening here?

The L1TEX unit will only fetch the missed 32B sectors in a cache line.

Is there still a point in marking pointers with const and restrict qualifiers? That used to be a hint to the compiler that the data is read-only and therefore can be cached in L1/texture cache, but now all data is cached there, both read-only and not read-only.

The compiler can perform additional optimizations if the data is known to be read-only.

See PTX Cache Operators
See PTX Memory Consistency Model

From my fourth assumption, I would think that whenever TEX->SM Returns is greater than L2->TEX Returns, the difference comes from cache hits. That's because when there's a cache hit, you get some sectors read from L1, but none from L2. Is this true?

L1TEX to SM return B/W is 128B/cycle. L2 to SM return B/W is in 32B sectors.

The Nsight Compute Memory Workload Analysis | L1/TEX Cache table shows

Sector Misses to L2 (32B sectors)
Returns to SM (cycles == 1-128B)