For the question, I will use this table as example:
But the memory hierarchy of this processor is not relevant for this question!
My question is if the latency values of each level cache are including the previous level cache access or not. I mean, if we assume that we only access to L2 after a L1 miss (and only access L3 after a L2 miss), looking in my example (for a L1 miss, L2 miss and L3 hit) the number of cicles spent will be ~21 cycles or will be ~(4+12+21) cycles?
And, if the answer is that the latency value includes the previous level cache acesses, the RAM access latency value does it too?
As I said, ignore the exact numbers of the processor, just take this question in a general way please.
I have seen a lot of "latency value tables" and I've never known how to interpret them correctly due to this doubt.
Normally (including this case) latency is given as the total latency for an access that stops at that level of the memory hierarchy (after missing in inner levels).
That's what you can actually measure (e.g. with a linked list that doesn't fit in L1d, doesn't fit in L2, or whatever), and what's easiest to think about.
Note that L3 and memory latency depend on contention from other cores, and also how big a ring-bus or mesh the request has to traverse to get from this core to a slice of L3. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?. e.g. for Intel, a quad-core "client" chip has better L3 and memory latency (and single-core bandwidth) than a big Xeon with the same cores.
OTOH it is somewhat reasonable to give pretty hard numbers for L1d and L2 cache, because they're per-core private. Beware that L1d load-use latency isn't always 4 cycles; that's only if you're pointer-chasing (dereferencing a pointer you just loaded), and you're using a simple addressing mode. Is there a penalty when base+offset is in a different page than the base?