Is there such thing as a semi-shared cache?

I'm doing a little research on the caching hierarchy and have come across the concept of shared and private caches. I can see examples of where caches are either private to a specific core (at Higher levels) and then where the cache is shared amongst all of the cores.

Are there any such examples of a cache being shared across a certain subset of cores at an intermediary hierarchy level and if not, why? My impression is that this would act as a middle-ground in the trade-off between latency and hit rate, although I'm unable to find an example of such a structure.

Solution

Sharing an intermediate level cache among multiple cores — but fewer than share the last level cache — is not a common design point. There are, however, a few designs that share L2 cache with as many cores as L3 cache is shared.

POWER4 and POWER5 both shared L2 cache among two cores, with L3 also shared among two cores. Since L3 cache data was stored off-chip (tags were on-chip) and each chip only had two cores, this is more similar to just sharing last level cache. Total L2 capacity was strongly constrained by chip size and L3 (having off-chip data) had somewhat high latency, so sharing to increase effective capacity was more attractive than for more recent designs with on-chip L3.

SPARC M7 is a more interesting example. M7 had a 256 KiB L2 data cache shared among two cores and an L2 instruction cache shared among four cores with L3 shared among four cores (the documentation I have seen is not entirely clear that L3 is not unified, but the evidence generally points to L3 being private to each cluster of four cores). Since data L2 is shared only among two cores, this might count as sharing L2 among fewer cores than L3 even though instruction L2 is shared with the same number of cores as L3.

Since M7 cores are 8-way threaded (as well as having only two-wide, out-of-order execution), L2 latency is less important (both thread-level parallelism and instruction level parallelism extracted from out-of-order execution can hide latency and a narrower core reduces the execution potential loss from a given number of stall cycles). Since the processor targets commercial workloads with high thread-level parallelism and low instruction-level parallelism, increasing the core and thread count were primary goals; sharing L2 caches can exploit common instruction and data use — the former is especially significant, but data sharing is not rare — facilitating lower total capacity, leaving room for more cores.

SPARC M8 was similar, but the L2 data cache was made private and the issue width doubled to four-wide. The increase in issue width increases the importance of L2 latency, especially with modest sized (16 KiB) L1 caches. Instruction cache is somewhat more latency tolerant given an ability to fetch ahead in an instruction stream.

Some considerations of the tradeoffs of intermediate level cache sharing

Increasing the size of an L2 cache via sharing would reduce the capacity miss rate when capacity demand is imbalanced (not only when one core is inactive but even when different phases of the same program are active on different cores), but sharing L2 among multiple cores increases conflict misses. Increasing associativity can eliminate this effect at the cost of higher energy per access.

When two cores access the same memory locations within a shortish period of time, a shared cache can increase effective capacity by reducing replication as well as potentially improving replacement decisions and providing limited prefetch. Sharing can also reduce cache block ping-pong if the writer and reader share L2 cache; however, explicitly taking advantage of such increases the complexity of software core allocation. If sharing of a frequently written value is unavoidably common, even a random reduction in ping-ponging may be attractive, but the benefit diminishes rapidly as the number of cores involved increases.

When L2 is an intermediate level cache, access latency has significant importance since capacity misses from a smaller L2 will generally hit in L3. Doubling the capacity will increase access latency by more than 40% (latency is roughly proportional to the square root of capacity). Arbitration among multiple requester's also tends to increase latency. (A non-uniform cache architecture, where different cache blocks have different latencies can compensate for such. E.g., in the context of sharing among two cores, a quarter of the capacity could be located closest to each core and the remaining half at an intermediate distance from both cores. However, NUCA introduces complexity in allocation.)

While increasing L2 capacity would use area that could otherwise by used by L3 cache (or more cores or other features), the size of L3 slices is typically so much larger than L2 capacity that this effect is not a primary consideration.

Sharing L2 among two cores also means that the provided bandwidth must be suitable for two highly active cores. While banking can be used to facilitate such (and extra bandwidth might be exploitable by a single active core), such increased bandwidth is not entirely free.

Sharing L2 would also motivate increasing the complexity of cache allocation and replacement. One would prefer to avoid one core wasting capacity (or even associativity). Such moderating mechanisms are sometimes provided for last level cache (e.g., Intel's Cache Allocation Technology), so this is not a hard barrier. Some of the moderating mechanisms could also facilitate better replacement in general, and L2 mechanisms could exploit metadata associated with L3 cache (reducing the tagging overhead for metadata tracking) to adjust behavior.

Sharing L2 cache also introduces complexity with respect to frequency adjustment. If one core supports a lower frequency, the interface between the core and L2 becomes more complex, increasing access latency. (In theory, a NUCA design like that mentioned above could have a small close portion running at the local frequency and only pay the clock boundary crossing penalty when accessing the more distant portion.)

Power gating is also simplified when L2 cache is dedicated to a single core. Rather than having three power domains (two cores and L2), a private L2 can be turned off with its core so only two power domains are needed. (Note that adding power domains is not extremely expensive and has been proposed for reducing power by dynamically reducing cache capacity.)

A shared L2 cache can also provide a convenient merging point for the on-chip network, reducing the number of nodes in the broader network. (This merging could alternatively be done behind the L2 cache, providing lower latency and potentially higher bandwidth communication between two cores while also providing isolation.)

Conclusion

Fundamentally, sharing increases utilization — which is good for throughput (roughly speaking, efficiency) but bad for latency (local performance) — but hinders optimization by specialization. For L2 caches with a backing L3 cache, the specialization benefit (lower latency) tends to outweigh the utilization benefit for general designs (which generally trade throughput and efficiency for lower latency). The on-chip L3 cache reduces the cost of L2 capacity misses, so a higher L2 miss rate with a faster L2 hit time can reduce average memory access time.

At the cost of design complexity and some overheads, sharing can be made more flexible or the costs of sharing can be reduced. Increasing complexity increases development risk and marketing risk (not just time to market but feature complexity increases the difficulty of the buyer's choice yet marketing simplifications can seem deceptive). For L2 caches, the costs of more nuanced sharing seem to have generally not be considered worth the potential benefits.