caching x86 cpu-architecture cpu-cache amd-processor

How is AMD's micro-tagged L1 data cache accessed?

I am learning about the access process of L1 cache of AMD processor. But I read AMD's manual repeatedly, and I still can't understand it.

My understanding of L1 data cache with Intel is:
L1 cache is virtual Indexed and physical tagged. Therefore, use the index bits of the virtual address to find the corresponding cache set, and finally determine which cache line in the cache set is based on the tag.
(Intel makes their L1d caches associative enough and small enough that the index bits come only from the offset-within-page which is the same in the physical address. So they get the speed of VIPT with none of the aliasing problems, behaving like PIPT.)

But AMD used a new method. In Zen 1, they have a 32-Kbyte, 8-way set associative L1d cache, which (unlike the 64KB 4-way L1i) is small enough to avoid aliasing problems without micro-tags.
From AMD's 2017 Software Optimization Manual, section 2.6.2.2 "Microarchitecture of AMD Family 17h Processor" (Zen 1):

The L1 data cache tags contain a linear-address-based microtag (utag) that tags each cacheline with the linear address that was used to access the cacheline initially. Loads use this utag to determine which way of the cache to read using their linear address, which is available before the load's physical address has been determined via the TLB. The utag is a hash of the load's linear address. This linear address based lookup enables a very accurate prediction of in which way the cacheline is located prior to a read of the cache data. This allows a load to read just a single cache way, instead of all 8. This saves power and reduces bank conflicts.

It is possible for the utag to be wrong in both directions: it can predict hit when the access will miss, and it can predict miss when the access could have hit. In either case, a fill request to the L2 cache is initiated and the utag is updated when L2 responds to the fill request.

Linear aliasing occurs when two different linear addresses are mapped to the same physical address. This can cause performance penalties for loads and stores to the aliased cachelines. A load to an address that is valid in the L1 DC but under a different linear alias will see an L1 DC miss, which requires an L2 cache request to be made. The latency will generally be no larger than that of an L2 cache hit. However, if multiple aliased loads or stores are in-flight simultaneously, they each may experience L1 DC misses as they update the utag with a particular linear address and remove another linear address from being able to access the cacheline.

It is also possible for two different linear addresses that are NOT aliased to the same physical address to conflict in the utag, if they have the same linear hash. At a given L1 DC index (11:6), only one cacheline with a given linear hash is accessible at any time; any cachelines with matching linear hashes are marked invalid in the utag and are not accessible.

It is possible for the utag to be wrong in both directions

What is the specific scenario of this sentence in the second paragraph? Under what circumstances will hit be predicted as miss and miss as hit? When the CPU accesses data from the memory to the cache, it will calculate a cache way based on utag. And just put it here? Even if the other cache way are empty?

Linear aliasing occurs when two different linear addresses are mapped to the same physical address.

How can different linear addresses map to the same physical address?

However, if multiple aliased loads or stores are in-flight simultaneously, they each may experience L1 DC misses as they update the utag with a particular linear address and remove another linear address from being able to access the cacheline.

What does this sentence mean? My understanding is to first calculate the utag based on the linear address (virtual address) to determine which cache way to use. Then use the tag field of the physical address to determine whether it is a cache hit? How is utag updated? Will it be recorded in the cache?

any cachelines with matching linear hashes are marked invalid in the utag and are not accessible. What does this sentence mean?

How does AMD judge cache hit or miss? Why are some hits regarded as misses? Can someone explain? Many thanks!

Solution

The L1 data cache tags contain a linear-address-based microtag (utag) that tags each cacheline with the linear address that was used to access the cacheline initially.

Each cache line in the L1D has a utag associated with it. This implies the utag memory structure is organized exactly like the L1D (i.e., 8 ways and 64 sets) and there is a one-to-one correspondence between the entries. The utag is calculated based on the linear address of the request that caused the line to be filled in the L1D.

Loads use this utag to determine which way of the cache to read using their linear address, which is available before the load's physical address has been determined via the TLB.

The linear address of a load is sent simultaneously to the way predictor and the TLB (it's better to use the term MMU, since there are multiple TLBs). A particular set in the utag memory is selected using certain bits of the linear address (11:6) and all of the 8 utags in that set are read at the same time. Meanwhile, a utag is calculated based on the linear address of the load request. When both of these operations complete, the given utag is compared against all the utags stored in the set. The utag memory is maintained such that there can be at most one utag in each set with the same value. In case of a hit in the utag memory, the way predictor predicts that the target cache line is in the corresponding cache entry in the L1D. Up until this point, the physical address is not yet needed.

The utag is a hash of the load's linear address.

The hash function was reverse-engineered in the paper titled Take A Way: Exploring the Security Implications of AMD’s Cache Way Predictors in Section 3 for a number of microarchitectures. Basically, certain bits of the linear address at positions 27:12 are XOR'ed with each other to produce an 8-bit value, which is the utag. A good hash function should: (1) minimize the number of linear address pairs that map to the same utag, (2) minimize the size of the utag, and (3) have a latency not larger than the utag memory access latency.

This linear address based lookup enables a very accurate prediction of in which way the cacheline is located prior to a read of the cache data. This allows a load to read just a single cache way, instead of all 8. This saves power and reduces bank conflicts.

Besides the utag memory and associated logic, the L1D also includes a tag memory and a data memory, all have the same organization. The tag memory stores physical tags (bit 6 up to the highest bit of the physical address). The data memory stores cache lines. In case of a hit in the utag, the way predictor reads only one entry in the corresponding way in the tag memory and data memory. The size of a physical address is more than 35 bits on modern x86 processors, and so the size of a physical tag is more than 29 bits. This is more than 3x larger than the size of a utag. Without way prediction, in a cache with more than one cache way, multiple tags would have to be read and compared in parallel. In an 8-way cache, reading and comparing 1 tag consumes much less energy than reading and comparing 8 tags.

In a cache where each way can be activated separately, each cache entry has its own wordline, which is shorter compared to a worldline shared across multiple cache ways. Due to signal propagation delays, reading a single way takes less time than reading 8 ways. However, in a parallelly-accessed cache, there is no way prediction delay, but linear address translation becomes on the critical path of the load latency. With way prediction, the data from the predicted entry can be speculatively forwarded to dependent uops. This can provide a significant load latency advantage, especially since linear address translation latency can vary due to the multi-level design of the MMU, even in the typical case of an MMU hit. The downside is that it introduces a new reason why replays may occur: in case of a misprediction, tens or even hundreds of uops may need to be replayed. I don't know if AMD actually forwards the requested data before validating the prediction, but it's possible even though not mentioned in the manual.

Reduction of bank conflicts is another advantage of way prediction as mentioned in the manual. This implies that different ways are placed in different banks. Section 2.6.2.1 says that bits 5:2 of the address, the size of the access, and the cache way number determine the banks to be accessed. This suggests there are 16*8 = 128 banks, one bank for each 4-byte chunk in each way. Bits 5:2 are obtained from the linear address of the load, the size of the load is obtained from the load uop, and the way number is obtained from the way predictor. Section 2.6.2 says that the L1D supports two 16-byte loads and one 16-byte store in the same cycle. This suggests that each bank has a single 16-byte read-write port. Each of the 128 bank ports are connected through an interconnect to each of the 3 ports of the data memory of the L1D. One of the 3 ports are connected to the store buffer and the other two are connected to the load buffer, possibly with intermediary logic for efficiently handling cross-line loads (single load uop but two load requests whose results are merged), overlapping loads (to avoid bank conflicts), and loads that cross bank boundaries.

The fact that way prediction requires accessing only a single way in the tag memory and the data memory of the L1D allows reducing or completely eliminating the need (depending on how snoops are handled) to make the tag and data memories truly multiported (which is the approach Intel has followed in Haswell), while still achieving about the same throughput. Bank conflicts can still occur, though, when there are simultaneous accesses to the same way and identical 5:2 address bits, but different utags. Way prediction does reduce bank conflicts because it doesn't require reading multiple entries (at least in the tag memory, but possibly also in the data memory) for each access, but it doesn't completely eliminate bank conflicts.

That said, the tag memory may require true multiporting to handle fill checks (see later), validation checks (see later), snooping, and "normal path" checks for non-load accesses. I think only load requests use the way predictor. Other types of requests are handled normally.

A highly accurate L1D hit/miss prediction can have other benefits too. If a load is predicted to miss in the L1D, the scheduler wakeup signal for dependent uops can be suppressed to avoid likely replays. In addition, the physical address, as soon as it's available, can be sent early to the L2 cache before fully resolving the prediction. I don't know if these optimizations are employed by AMD.

It is possible for the utag to be wrong in both directions: it can predict hit when the access will miss, and it can predict miss when the access could have hit. In either case, a fill request to the L2 cache is initiated and the utag is updated when L2 responds to the fill request.

On an OS that supports multiple linear address spaces or allows synonyms in the same address space, cache lines can only be identified uniquely using physical addresses. As mentioned earlier, when looking up a utag in the utag memory, there can either be one hit or zero hits. Consider first the hit case. This linear address-based lookup results in a speculative hit and still needs to be verified. Even if paging is disabled, a utag is still not a unique substitute to a full address. As soon as the physical address is provided by the MMU, the prediction can be validated by comparing the physical tag from the predicted way with the tag from the physical address of the access. One of the following cases can occur:

The physical tags match and the speculative hit is deemed a true hit. Nothing needs to be done, except possibly triggering a prefetch or updating the replacement state of the line.
The physical tags don't match and the target line doesn't exist in any of the other entries of the same set. Note that the target line cannot possibly exist in other sets because all of the L1D memories use the same set indexing function. I'll discuss how this is handled later.
The physical tags don't match and the target line does exist in another entry of the same set (associated with a different utag). I'll discuss how this is handled later.

If no matching utag was found in the utag memory, there will be no physical tag to compare against because no way is predicted. One of the following cases can occur:

The target line actually doesn't exist in the L1D, so the speculative miss is a true miss. The line has to be fetched from somewhere else.
The target line actually exists in the same set but with a different utag. I'll discuss how this is handled later.

(I'm making two simplifications here. First, the load request is assumed to be to cacheable memory. Second, on a speculative or true hit in the L1D, there are no detected errors in the data. I'm trying to stay focused on Section 2.6.2.2.)

Accessing the L2 is needed only in cases 3 and 5 and not in cases 2 and 4. The only way to determine which is the case is by comparing the physical tag of the load with the physical tags of all present lines in the same set. This can be done either before or after accessing the L2. Either way, it has to be done to avoid the possibility of having multiple copies of the same line in the L1D. Doing the checks before accessing the L2 improves the latency in cases 3 and 5, but hurts it in cases 2 and 4. Doing the checks after accessing the L2 improves the latency in cases 2 and 4, but hurts it in cases 3 and 5. It's possible to both perform the checks and send a request to the L2 at the same time. But this may waste energy and L2 bandwidth in cases 3 and 5. It seems that AMD decided to do the checks after the line is fetched from the L2 (which is inclusive of the L1 caches).

When the line arrives from the L2, the L1D doesn't have to wait until it gets filled in it to respond with the requested data, so a higher fill latency is tolerable. The physical tags are now compared to determine which of the 4 cases has occurred. In case 4, the line is filled in the data memory, tag memory, and utag memory in the way chosen by the replacement policy. In case 2, the requested line replaces the existing line that happened to have the same utag and the replacement policy is not engaged to chose a way. This happens even if there was a vacant entry in the same set, essentially reducing the effective capacity of the cache. In case 5, the utag can simply be overwritten. Case 3 is a little complicated because it involves an entry with a matching physical tag and a different entry with a matching utag. One of them will have to be invalidated and the other will have to be replaced. A vacant entry can also exist in this case and not utilized.

Linear aliasing occurs when two different linear addresses are mapped to the same physical address. This can cause performance penalties for loads and stores to the aliased cachelines. A load to an address that is valid in the L1 DC but under a different linear alias will see an L1 DC miss, which requires an L2 cache request to be made. The latency will generally be no larger than that of an L2 cache hit. However, if multiple aliased loads or stores are in-flight simultaneously, they each may experience L1 DC misses as they update the utag with a particular linear address and remove another linear address from being able to access the cacheline.

This is how case 5 (and case 2 to a lesser extent) can occur. Linear aliasing can occur within the same linear address space and across different address spaces (context switching and hyperthreading effects come into play).

It is also possible for two different linear addresses that are NOT aliased to the same physical address to conflict in the utag, if they have the same linear hash. At a given L1 DC index (11:6), only one cacheline with a given linear hash is accessible at any time; any cachelines with matching linear hashes are marked invalid in the utag and are not accessible.

This is how cases 2 and 3 can occur and they're handled as discussed earlier. This part tells that the L1D uses the simple set indexing function; the set number is bits 11:6.

I think huge pages make cases 2 and 3 more likely to occur because more than half of the bits used by the utag hash function become part of the page offset rather than page number. Physical memory shared between multiple OS processes makes case 5 more likely.