Search code examples
cachingintelcpu-architecturecpu-cachetlb

How does the VIPT to PIPT conversion work on L1->L2 eviction


This scenario came into my head and it seems a bit basic but I'll ask.

So there is a virtual index and physical tag in L1 but the set becomes full so it is evicted. How does the L1 controller get the full physical address from the virtual index and the physical tag in L1 so the line can be inserted into L2? I suppose it could search the TLB for the combination but that seems slow and also it may not be in the TLB at all. Perhaps the full physical address from the original TLB translation is stored in the L1 next to the cache line?

This also opens the wider question of how the PMH invalidates the L1 entry when it writes accessed bits to the PTEs and PDEs and so on. It is my understanding it interfaces with the L2 cache directly for physical addresses but when it writes accessed and modified bits, as well as sending an RFO if it needs to, it would have to reflect the change in the copy in the L1 if there is one, meaning it would have to know the virtual index of the physical address. In this case if the full physical address were also stored in the L1 then it offers a way for the L2 to be able to index it as well.


Solution

  • Yes, outer caches are (almost?) always PIPT, and memory itself obviously needs the physical address. So you need the physical address of a line when you send it out the memory hierarchy.


    In Intel CPUs, the VIPT L1 caches have all the index bits from the offset-within-page part of the address, so virt=phys, avoiding any aliasing problems. It's basically PIPT but still being able to fetch data/tags from the set in parallel with the TLB lookup for the pagenumber bits to create an input for the tag comparator.

    The full physical address is known just from L1d index + tag, again because it behaves like a PIPT for everything except load latency.


    In the general case of virtually-indexed caches where some of the index bits do come from the page-number, that's a good question. Such systems do exist, and page-colouring is often used by the OS to avoid aliasing. (So they don't need to flush the cache on context switches.)

    Virtually indexed physically tagged cache Synonym has a diagram for one such VIPT L1d: the physical tag is extended a few bits to come all the way down to the page offset, overlapping the top index bit.

    Good observation that a write-back cache needs to be able to evict dirty lines long after the TLB check for the store was done. Unlike a load, you don't still have the TLB result floating around unless you stored it somewhere.

    Having the tag include all the physical address bits above the page offset (even if that overlaps some index bits) solves this problem.

    Another solution would be a write-through cache, so you do always have the physical address from the TLB to send with the data, even if it's not reconstructable from the cache tag+index. Or for read-only caches, e.g. instruction caches, being virtual isn't a problem.


    But I don't think a TLB check at eviction could solve the problem for the non-overlapping tag case: you don't have the full virtual address anymore, only the low bits of your page-number are virtual (from the index), the rest are physical (from the tag). So this isn't a valid input to the TLB.

    So besides being inefficient, there's also the equally important problem that it wouldn't work at all. :P Maybe there's some trick I don't know or something I'm missing, but I don't think even a special TLB indexed both ways (phys->virt and virt->phys) could work, because multiple mappings of the same physical page are allowed.


    I think real CPUs that have used VIVT caches have normally had them as write-through. I don't know the history well enough to say for sure or cite any examples. I don't see how they could be write-back, unless they stored two tags (physical and virtual) for every line.

    I think early RISC CPUs often had 8k direct-mapped caches.

    But first-gen classic 5-stage MIPS R2000 (using external SRAM for its L1) apparently had a PIPT write-back cache, if the diagram in these slides labeled MIPS R2000 is right, showing a 14-bit cache index taking some bits from the physical page number of the TLB result. But it still works with 2 cycle latency for loads (1 for address-generation in the EX stage, 1 for cache access in the MEM stage).

    Clock speeds were much lower in those days, and caches+TLBs have gotten larger. I guess back then a 32-bit binary adder in the ALU did have comparable latency to TLB + cache access, maybe not using as aggressive carry-lookahead or carry-select designs.

    A MIPS 4300i datasheet, (variant of MIPS 4200 used in Nintendo 64) shows what happens where/when in its 5-stage pipeline, with some things happening on the rising or falling edge of the clock, letting it divide some things up into half-clocks within a stage. (so e.g. forwarding can work from the first half of one stage to the 2nd half of another, e.g. for branch target -> instruction fetch, still without needing extra latching between half-stages.)

    Anyway, it shows DVA (data virtual address) calculation happening in EX: that's the register + imm16 from a lw $t0, 1234($t1). Then DTLB and DCR (data-cache read) happen in parallel in the first half of the Data Cache stage. (So this is a VIPT). DTC (Data Tag Check) and LA (load alignment e.g. shifting for LWL / LWR, or for LBU to extract a byte from a fetched word) happen in parallel in the 2nd half of the stage.

    So I still haven't found confirmation of a single-cycle (after address calculation) PIPT MIPS. But this is definite confirmation that single-cycle VIPT was a thing. From Wikipedia, we know that its D-cache was 8-kiB direct-mapped write-back.