Can compiler sometimes cache variable declared as volatile

From what I know, the compiler never optimizes a variable that is declared as volatile. However, I have an array declared like this.

volatile long array[8];

And different threads read and write to it. An element of the array is only modified by one of the threads and read by any other thread. However, in certain situations I've noticed that even if I modify an element from a thread, the thread reading it does not notice the change. It keeps on reading the same old value, as if compiler has cached it somewhere. But compiler in principal should not cache a volatile variable, right? So how come this is happening.

NOTE: I am not using volatile for thread synchronization, so please stop giving me answers such as use a lock or an atomic variable. I know the difference between volatile, atomic variables and mutexes. Also note that the architecture is x86 which has proactive cache coherence. Also I read the variable for long enough after it is supposedly modified by the other thread. Even after a long time, the reading thread can't see the modified value.

Solution

But compiler in principal should not cache a volatile variable, right?

No, the compiler in principle must read/write the address of the variable each time you read/write the variable.

[Edit: At least, it must do so up to the point at which the the implementation believes that the value at that address is "observable". As Dietmar points out in his answer, an implementation might declare that normal memory "cannot be observed". This would come as a surprise to people using debuggers, mprotect, or other stuff outside the scope of the standard, but it could conform in principle.]

In C++03, which does not consider threads at all, it is up to the implementation to define what "accessing the address" means when running in a thread. Details like this are called the "memory model". Pthreads, for example, allows per-thread caching of the whole of memory, including volatile variables. IIRC, MSVC provides a guarantee that volatile variables of suitable size are atomic, and it will avoid caching (rather, it will flush as far as a single coherent cache for all cores). The reason it provides that guarantee is because it's reasonably cheap to do so on Intel -- Windows only really cares about Intel-based architectures, whereas Posix concerns itself with more exotic stuff.

C++11 defines a memory model for threading, and it says that this is a data race (i.e. that volatile does not ensure that a read in one thread is sequenced relative to a write in another thread). Two accesses can be sequenced in a particular order, sequenced in unspecified order (the standard might say "indeterminate order", I can't remember), or not sequenced at all. Not sequenced at all is bad -- if either of two unsequenced accesses is a write then behavior is undefined.

The key here is the implied "and then" in "I modify an element from a thread AND THEN the thread reading it does not notice the change". You're assuming that the operations are sequenced, but they're not. As far as the reading thread is concerned, unless you use some kind of synchronization the write in the other thread hasn't necessarily happened yet. And actually it's worse than that -- you might think from what I just wrote that it's only the order of operations that is unspecified, but actually the behavior of a program with a data race is undefined.