Can you give a pseudo code/ detailed explanation about how the following cache replacement policies are implemented?
It is difficult for me to establish a solid understanding of how they work step by step. Thus, I believe that an algorithm for each policy will be crystal clear.
Logically how they're implemented is that the cache lines in a set are ordered, so they form a queue. The head of the queue is the line that will be evicted next time the cache needs to allocate a line in that set.
New lines are allocated at the end of the queue, farthest away from the "chopping block". If there was an invalid line in the set, no eviction is needed: the elements move to fill the gap in the queue. Otherwise an eviction is needed to make room in this set.
For LRU, on a hit that line moves to the most-recently-used (MRU) position. For FIFO, it doesn't.
For random, there's no queue; the cache just picks a random line to replace if there are no invalid lines in the set.
To minimize cache pollution, a non-temporal load or prefetch could allocate the line in the LRU position (next in line to be evicted) instead of the usual LRU. This is useful for data you don't expect to read more than once before it would be evicted anyway. (I think AMD CPUs actually work like this for NT prefetch. Intel CPUs implement prefetchnta
by restricting it to populating only 1 or 2 ways of any set in L3 cache.)
Physically how they're implemented can vary, as long as they implement the desired algorithm. IDK if it's typical to actually copy the data around (write it all back to the cache array after a read hit, for LRU), or to use extra bits in the tags to record the ordering and only write those back. I guess that would be a tradeoff between a much wider extra write port vs. 3 extra tag bits.
It's common for fast L1d caches to read tags + data in parallel, so the data for all ways in a set is right there to pick from based on the tag comparator result. But for more power-efficient caches that wait for the tag check and then only fetch the data from the way that hits (if there is a hit), copying the data would be impractical.
I think it's very likely that the LRU logic is normally implemented with extra tag bits that give the ordering of the ways in a set. Your mental model can ignore that and just look at lines moving positions, though.
Or instead of me making stuff up, just read Paul Clayton's answer on How is an LRU cache implemented in a CPU?
Apparently some designs use a pseudo-LRU instead of true LRU, to reduce the number of extra bits per set. e.g. a 4-bit number for each of 16 ways in a large last-level cache, or 8x 3-bit numbers in an 8-way cache. (Plus the hardware to do the LRU logic in parallel would be quite significant.) Or instead of just naively numbering each way, you can potentially encode the possible states of the queue in as few as 16 bits per set. But that's still significant.
(I don't know what modern high-performance x86 CPUs use in practice- whether true LRU is worth the cost for small/fast L1d and/or for larger/slower L2 and L3)