Search code examples
cachingmemory-managementcpu-cachenehalem

Number of banks in Nehalem l2 cache


I was just studying the access time for different cache configurations when i stumbled on a term in the cacti interface "Number of Banks".

Number of banks is the number of interleaved modules in a cache which increases the bandwidth of the cache and the number of parallel accesses to it.

In this context, I wanted to find the number of banks in the caches of Nehalem architecture. I googled for this thing but did not hit anything useful.

My reasoning here is :

  1. L1 data and instruction cache must have single bank. The access granularity is a word here.
  2. L2 cache supports misses of L1 data and instruction cache. Hence it must support 2 banks.
  3. L3 cache is usually shared across all the cores in the system and hence it must have a large (32) number of banks.

Is my intuition correct ?? Plus, does the number of banks change the way the data/program in structured (Ideally it should not but still ...) ??


Solution

  • The overview graphics of the Wikipedia article depicts the Nehalem (first cpu branded as "Core i7") to have 256 KByte of L2 Cache per core.

    I don't get what you mean by the word "bank" here. Nehalem's cache is 8-way associative with 64bits (8 bytes) per cache line.

    That means that for every read/write access to the cache 8 bytes of data are transferred which corresponds well to a 64bit architecture where all virtual addresses have 8 bytes. So every time an address has to be retrieved from or stored in memory, 8 bytes have to be transported, thus it is a natural fit to design a single entry in a cache that way. (Other cache sizes make sense, too, depending on applications: Such as larger sizes for data caches for vector processing units).

    x-way Associativity determines the relationship of a memory address and the place where information in that address can be stored inside the cache. The term "8 ways associativity" refers to the fact that data stored at a certain memory address can be held in 8 different cache lines. Caches have an address comparison mechanism to select the matching entry inside one way, and some replacement strategy to decide which of the x ways is to be used - possibly expelling a previous valid value.

    Your using of the term "bank" probably refers to one such "set" of this 8-way associativity. Thus the answer to your question probably is "8". And again, that's one L2 cache per core, and each have that structure.

    Your assumption on simultaneous access is a valid one as well. It is documented e.g. for ARM's Cortax A15 However, if and how those sets or banks of the cache can be accessed independently is anyone's guess. The Wikipedia diagram shows a 256 bit bus between the L1 data cache and the L2 cache. This could both imply that it is possible to access 4 ways independently (4*64=256, but more likely is that only one memory load/store is actually transferred at any given time and the slower L2 cache just feeds 4 cache lines simultaneously to the faster L1 cache in what one could call a burst.

    This assumption is supported by the fact that the System Architecture Manual which can be found on intel's page, in chapter 2.2.6 lists the later Sandy Bridge improvements, emphasizing "Internal bandwidth of two loads and one store each cycle.". Thus CPUs before SandyBridge should have a smaller number of concurrent load/stores.

    Note that there's a difference of "in flight" load/stores and actual data transmitted. "in flight" are those operations that are currently being executed. In case of a load that could entail waiting for the memory to yield data after all caches reported misses. So you can have many loads going on in parallel, but you can still have the data bus between any two caches used only once at any given time. The above SandyBridge improvement actually widens that data bus to two loads and one store actually transmitting data at the same time which Nehalem (one "tock", or one architecture before Sandy Bridge) could not do.

    Your intuition is not correct on some accounts:

    1. Hyper threading and multi threading in general allows a cpu to execute more than one statement per cycle. (Nehalem, chapter 2.2.5: "Provides two hardware threads (logical processors) per core. Takes advantage of 4-wide execution engine". Thus it makes sense to support multiple concurrent load/stores to a L1 cache.
    2. The L2 cache serves both L1 data and L1 instruction cache - you're correct on that part. For the reason in (1) it may make sense to support more than 2 simultaneous operations.
    3. Generally you could scale that number up for the L3 cache, but in practice that does not make sense. I don't know where you got the number 32 from, maybe it is just a guess. For any additional access point ("bank" in your terminology) you must have address decoders, tag arrays (for handling address comparisons to cache lines, the replacement strategy, and any cache data flags (dirty bit, etc)). So every access port requires some overhead in transistors and thus area and power on silicon. Every port that exists also slows down cache access, even if it is not in use. (Details are out of scope of this answer). So this is a delicate design decision, and 32 is generally way to high. Usually for any kind of memory inside a cpu numbers range from 1 to 6-8 read ports and 1 to 2-4 write ports. There may be exceptions, of course.

    Regarding your point about software optimizations: Worry if you are a low level hardware/firmware developer. Otherwise just follow high level ideas: If you can, keep your innermost loop of intense operations small to make it fit into a L3 cache. Do not start more threads with intense computing on local data than you have cores. If you do start to worry about such speed implications, start compiling/optimizing your code with the matching cpu switches, and control other tasks on the machine (even infrastructure services).

    In summary:

    • Nehalem's L2 cache is 8 way associative
    • It supports less than 2 simultaneous load and 1 store operation, probably only one. But each load/store can transmit up to 256 bits at one time to/from the L1 data cache.
    • The number of simultaneous load/store operation does not scale up to 32 for the L3 cache due to physical design restrictions (timing/area/power)
    • You should generally not worry about these details too much in your applications - except you know for sure that you have to (e.g. in high performance computing)