does volatile keword in java really have to do with caches?

From what I've read, the "volatile" keyword in java ensures that a thread always fetches the most up-to-date value of a particular pointer, usually by reading/writing directly from/to memory to avoid cache inconsistencies.

But why is this needed? To my knowledge, this is already done on a hardware level. If I remember correctly from my system architecture class, A processor-core that updates a memory location, sends an invalidation signal to the other processor's caches, forcing them to fetch those lines from memory when the time comes to do so. Or, if it was the other way around - if a processor fetches memory, it will force cached (but not written) lines of other caches to be flushed to memory first.

My only theory is that this actually has nothing to do with caches at all, despite all the explanations I've read. It has to do with that data in JVM can reside in two places - a thread's local stack and the heap. And that a java thread may use its stack as a kind of cache. I'll buy that, but that also means that using volatile on data that reside on the heap is useless, since it's shared by all threads and abides by hardware implemented coherence?

Eg:

public final int[] is = new int[10];

accessing the is's data will always result in getting the most up-to-date data, since the data resides on the heap. The pointer, however, is a primitive and might fall victim to the stack problem, but since it's final we doesn't have this problem.

Are my assumptions correct?

Edit: This is not a duplicate as far as I can tell. The alleged duplicate thread is one of those misleading answers that says that it has to do with cache coherence. My question is not what volatile is used for, nor how to use it. It's testing a theory and more in depth.

Solution

I've done some research and come up with the following:

A Volatile variable is affected in two ways.

Take this java example:

public int i = 0;
public void increment(){
   i++;
}

Without volatile, the JIT will issue the following psudo instructions in the increment method:

LOAD  R1,i-address
... arbitrary number of instructions, not involving R1
ADDI  R1,1
... arbitrary number of instructions, not involving R1
... this is not guaranteed to happen, but probably will:
STORE R1, i-address

Why the arbitrary instructions? because of optimization. the pipeline will be stuffed with instructions not involving R1 to avoid pipeline stalls. In other words, you get out of order execution. Rewriting i to memory will also be prevented it possible. If the optimizer can figure out that this is unnecessary, it won't do it, it might miss the fact that I is accessed from another thread though and by then i will still be 0.

When we change i to volatile, we get:

STEP 1

LOAD  R1,i-address
ADDI  R1,1
STORE R1, i-address

Volatile prevents out of order execution and will not try to stuff the pipeline to solve hazards. And will never store i locally, and by locally I mean in a register, or a stack frame. It will guarantee that any operation on i will involve a LOAD and a STORE of it, in other words fetching and writing to memory. Memory, however does not translate to main-memory, or RAM, or whatnot, it implies the memory-hierarchy. LOADS and STORES are used for all variables, volatile or not, but not to the same extent. How they are handled is up to the chip architects.

STEP 2

LOAD  R1,i-address
ADDI  R1,1
LOCK STORE R1, i-address

The lock instruction issues a memory barrier, meaning that any other thread trying to read or write on i's address will have to wait until the store operation has been completed. This ensures that the actual write-back of i is atomic.

Note though that the java line "i++" is not atomic. Things can still happen between the LOAD and STORE instruction. That's why you typically need explicit locks, which are implemented with volatiles to be able to truly make atomic operations on i. Take this example:

volatile int i = 0;

THREAD A
{
   for (int j = 0; j < 1000; j++)
       i++;
} 
THREAD B
{
   for (int j = 0; j < 1000; j++)
       i++;
}

will produce unpredictable results of a multi-core processor. And needs to be solved like this:

private volatile int i = 0;

public synchronized incrementI(){
   i++;
}

Sources: https://docs.oracle.com/javase/tutorial/essential/concurrency/atomic.html source: https://docs.oracle.com/cd/E19683-01/806-5222/codingpractices-1/index.html

Conclusion: According to both Intel and AMD, cache consistency is managed by the hardware and thus volatile has nothing to do with caches. And the "volatiles are forced to live in main memory" is a myth. It does, however, probably indirectly cause additional cache invalidations, since STORE's are used more frequently.

I am open to the idea that volatile will cause a write-through on obscure architectures, though.