L1 cache persistance across CUDA kernels

I understand that shared memory on GPU does not persist across different kernels. However, does the L1 cache persist across different kernel calls?

Solution

The SM L1 cache is invalidated between all operations on the same stream or the null stream to guarantee coherence. But it doesn't really matter, because the L1 cache on GPUs is not really designed to improve temporal locality within a given thread of execution. On a massively parallel processor, it is parallel spatial locality that matters. What this means is that you want threads that are executing nearby to each other to access data that are nearby to each other.

When a cached memory load is performed, it is done for a single warp, and the cache stores cache line(s) that are accessed by threads in that warp (ideally only a single line). If the next warp accesses the same cache line(s), then the cache will hit and latency will be reduced. Otherwise, the cache will be updated with different cache lines. If memory accesses are very spread out, then later warps will probably evict cache lines from earlier warps before they get reused.

By the time another kernel runs, it is not likely for the data in the cache to be valid because many warps are likely to have been run by that SM for the previous kernel, so it doesn't really matter if it persists.