java multithreading visibility shared-variable

Why are shared variables cached in CPU caches?

I'm trying to understand the Java Memory Model but have been failing to get a point regarding CPU caches.

As far as I know it, in JVM we have the following locations to store local and shared variables:

local variables -- on thread stack

shared variables -- in memory, but every CPU cache has a copy of it

So my question is: why store local variables on stack, and (cache) shared variables in CPU cache? Why not the other way around (Supposing that CPU cache is too expensive to store both), we cache local variables in CPU caches and just fetch shared variables from memory? Is this part of the Java language design or the computer architecture?

Further: as simple as "CPU cache" sounds, what if several CPUs share one cache? And in systems with multi-level caches, which level of cache will the copy of shared variables be stored in? Further, if more than 1 threads are running in the same CPU core, does it mean that they are sharing the same set of cached shared-variables, and hence even if the shared variable is not defined volatile, accesses of the variable is still instantly visible to the other threads running on the same CPU?

Solution

"Local" and "shared" variables are meaningless outside the context of your code. They don't influence where or even if the state is cached. It's not even useful to think or reason in terms of where your state is stored; the entire reason the JMM exists is so that details like these, which vary from architecture to architecture are not exposed to the programmer. By relying on low-level hardware details, you are asking the wrong questions about the JMM. It's not useful to your application, it makes it fragile, easier to break, harder to reason with, and less portable.

That said, in general, you should assume that any program state, if not all states, are eligible to be cached. The fact is that what is cached does not actually matter, just that anything and everything can be, whether it be primitive types or reference types, or even state variables encapsulated by several fields. Whatever instructions a thread runs (and those instructions vary by architecture too - beware!), those instructions default back on the CPU to determine what is relevant to cache and what not to cache; it is impossible for programmers to do this themselves (although it is possible to influence where state variables may be cached , see what false sharing is).

Again, we can also make some more generalizations about x86, that active primitive types are probably put on registers because P/ALUs will be able to work with them the fastest. Anything else goes though. It's possible for primitives to be moved to L1/2 cached if they are core-local, it's certainly possible that they would be overwritten quite quickly. The CPU might put state variables on a shared L3 if it thinks that there will be a context switch in the future, or it could not. A hardware expert will need to respond to that.

Ideally, state variables will be stored in the closest cache (register, L1/2/3, then main memory) to the processor unit. That's up the CPU to decide though. It is impossible to reason about cache semantics at the Java level. Even if hyper threading is enabled (I'm not sure what the AMD equivalent is), threads are not allowed to share resources, and even then, if they were, recall that visibility is not the only problem associated with shared state variables; in the case that the processor performs pipelining, you still need the appropriate instructions to ensure the correct ordering (this is even after you get rid of read/write buffering on the CPU), whether this be hwsync or the appropriate fences or others.

Again, reasoning about the properties of the cache is not useful, both because the JMM handles that for you and because it is indeterminable, where/when/what is cached. Further, even if you did know the where/when/what questions, you STILL cannot reason about data visibility; all caches treat cached data in the same way anyways, and you will need to rely on the processor updating the cache state between the ME(O)SI states, instruction ordering, load/store buffering, write-back/through, etc... And you still haven't dealt with problems that can occur at the OS and JVM level yet. Again, luckily, the JDK allows you to use basic tools such as volatile, final, and atomics that work consistently across all platforms and produce code that is predictable and easy(er) to reason with.