Load/Store caching of NVIDIA GPU

I have a question from the book "Professional CUDA C Programming"

It says about the GPU cache:

On the CPU, both memory loads and stores can be cached. However, on the GPU only memory load operations can be cached; memory store operations cannot be cached. [p142]

But on the other page, it says:

Global memory loads/stores are staged through caches. [p158]

I'm really confused whether the GPU cache the store or not.

If the first quote is correct, I understand it as GPU does not cache the write (the modification of data).

Thus, the write directly goes to the global memory, DRAM

Also, is it similar as "No-Write Allocate" of CPU??

I want some clear explanation from you guys... Thanks!

Solution

Even the ancient Fermi architecture (compute capability 2.x) cached stores in L2 according to its whitepaper (emphasis mine):

Fermi features a 768 KB unified L2 cache that services all load, store, and texture requests.

So the book seems to be talking about write-caching in L1 data cache specifically.

The short answer regarding write-caching in L1 is that since the Volta architecture (compute capability 7.0, newer than the book quoted by OP) stores can certainly be cached in L1 according to its whitepaper:

Enhanced L1 Data Cache and Shared Memory

[...] Prior NVIDIA GPUs only performed load caching, while GV100 introduces write-caching (caching of store operations) to further improve performance.

and the Turing Tuning Guide (compute capability 7.2 and 7.5)

Like Volta, Turing’s L1 can cache write operations (write-through).

For context: Pre-Volta architectures did not even consistently cache global loads in L1 data cache. I.e. some GPUs did it always, some needed special compilation flags to do it and some could not do it at all (although one could always use the smaller on-chip constant-cache for read-caching instead).

As all architectures since Volta/Turing feature the same on-chip unified data cache architecture and as there is nothing to be found in their tuning guides regarding write-caching in L1, one can safely assume that these newer architectures (Ampere, Ada, Hopper and Blackwell) also do global memory write-caching in L1.

For a deeper dive, take a look at the PTX ISA's cache operators (also available as CUDA C++ intrinsics called store functions using cache hints):

Operator Meaning

.wb Cache write-back all coherent levels. The default store instruction cache operation is st.wb, which writes back cache lines of coherent cache levels with normal eviction policy. [...]

.cg Cache at global level (cache in L2 and below, not L1). Use st.cg to cache global store data only globally, bypassing the L1 cache, and cache only in the L2 cache.

.cs Cache streaming, likely to be accessed once. The st.cs store cached-streaming operation allocates cache lines with evict-first policy to limit cache pollution by streaming output data.

.wt Cache write-through (to system memory). The st.wt store write-through operation applied to a global System Memory address writes through the L2 cache.

Operator	Meaning
`.wb`	Cache write-back all coherent levels. The default store instruction cache operation is `st.wb`, which writes back cache lines of coherent cache levels with normal eviction policy. [...]
`.cg`	Cache at global level (cache in L2 and below, not L1). Use `st.cg` to cache global store data only globally, bypassing the L1 cache, and cache only in the L2 cache.
`.cs`	Cache streaming, likely to be accessed once. The `st.cs` store cached-streaming operation allocates cache lines with evict-first policy to limit cache pollution by streaming output data.
`.wt`	Cache write-through (to system memory). The `st.wt` store write-through operation applied to a global System Memory address writes through the L2 cache.

This table is written confusingly (maybe to avoid describing architecture-specific behavior) but given the information we already have about L1 write-caching, the best interpretation I can come up with is that .wb and .wt concern how the write is handled by L2 while leaving L1 write-caching up to the particular architecture as L1 is not a coherent level and does probably not contain the necessary logic to implement write-back. As the description for .wb does not concern the handling in non-coherent levels at all this is fine.

One can think of the L1 write-caching as always write-through (i.e. eagerly writing to the next level) but with invalidation of the L1 cache-line on pre-Volta architectures which is not what one typically thinks of for "write-through" but should still be fine.

.cg explicitly disallows caching in L1, i.e. it should always reproduce the behavior of pre-Volta architectures. And .cs does not mention the cache levels at all and just determines the eviction policy. This interpretation agrees with the one given at Making better sense of the PTX store caching modes (assuming a Volta or later architecture).

So stores are always cached in L2 while L1 write-caching depends on the GPU architecture and the actual store-instruction used.