I have a book statement:
Implementation of LRU in full associative TLB is very expensive, so the general way is to use random substitution.
I don't understand why it's expensive under full associative cache. Isn't that just adding an additional reference bit...?
LRU requires maintaining a total order relation between all valid cache lines in a cache set. For example, consider a 3-way cache set with the following lines A, B, and C ordered from the most recently accessed to the least recently accessed (represented as ABC). If C is accessed next, then the order becomes CAB. If a new line, D, needs to be filled in the same cache set, since there are no invalid lines, the LRU replacement policy will choose B to be evicted and replaced by the new line. Then the order becomes DCA.
For a 3-way cache, there are up to 3*2 = 6 possible orders for the lines in each set. In general, for an N-way cache, there are up to N! (N factorial) possible orders. Theoretically, you need at least log2(N!) bits (rounded up to the nearest integer) per cache set to maintain the LRU property accurately. Note that log2(N!) is Θ(Nlog(N)), so it grows superlinearly with respect to the number of ways. No normal person likes anything whose cost grows superlinearly.
A particularly cheap case is a 2-way cache, where the LRU state requires only log2(2!) = 1 bits, i.e., a single bit. It is much more expensive for any other number of ways though.
In practice, though, there is no easy way to maintain a single number that represents the LRU state of a set. If the current LRU state is X and then some access to a line occurs, how can the next LRU state be determined? There is no simple mathematical relation that can be implemented in hardware. So instead of using a single number, a realistic implementation would use multiple numbers, one per cache line. In this case, these numbers are called ages. Such design would even require (many) more bits than the theoretical minimum log2(N!) to maintain the LRU state.
Aside from the hardware overhead, the LRU replacement policy is not necessarily optimal for performance. It depends on the memory access patterns of the applications in the target market domain and the rest of the cache hierarchy.
LRU has been used in many real processors. Caches that are 2-way associative typically use LRU. For example, AMD SledgeHammer uses LRU for both L1I and L1D caches. The Itanium 2 processor's L1 instruction cache uses LRU and it is 4-way associative. Usually, when the number of ways is larger than two, caches don't use LRU.