c#thread-safety memory-model memory-barriers mesi

Why is the standard C# event invocation pattern thread-safe without a memory barrier or cache invalidation? What about similar code?

In C#, this is the standard code for invoking an event in a thread-safe way:

var handler = SomethingHappened;
if(handler != null)
    handler(this, e);

Where, potentially on another thread, the compiler-generated add method uses Delegate.Combine to create a new multicast delegate instance, which it then sets on the compiler-generated field (using interlocked compare-exchange).

(Note: for the purposes of this question, we don't care about code that runs in the event subscribers. Assume that it's thread-safe and robust in the face of removal.)

In my own code, I want to do something similar, along these lines:

var localFoo = this.memberFoo;
if(localFoo != null)
    localFoo.Bar(localFoo.baz);

Where this.memberFoo could be set by another thread. (It's just one thread, so I don't think it needs to be interlocked - but maybe there's a side-effect here?)

(And, obviously, assume that Foo is "immutable enough" that we're not actively modifying it while it is in use on this thread.)

Now I understand the obvious reason that this is thread-safe: reads from reference fields are atomic. Copying to a local ensures we don't get two different values. (Apparently only guaranteed from .NET 2.0, but I assume it's safe in any sane .NET implementation?)

But what I don't understand is: What about the memory occupied by the object instance that is being referenced? Particularly in regards to cache coherency? If a "writer" thread does this on one CPU:

thing.memberFoo = new Foo(1234);

What guarantees that the memory where the new Foo is allocated doesn't happen to be in the cache of the CPU the "reader" is running on, with uninitialized values? What ensures that localFoo.baz (above) doesn't read garbage? (And how well guaranteed is this across platforms? On Mono? On ARM?)

And what if the newly created foo happens to come from a pool?

thing.memberFoo = FooPool.Get().Reset(1234);

This seems no different, from a memory perspective, to a fresh allocation - but maybe the .NET allocator does some magic to make the first case work?

My thinking, in asking this, is that a memory barrier would be required to ensure - not so much that memory accesses cannot be moved around, given the read is dependent - but as a signal to the CPU to flush any cache invalidations.

My source for this is Wikipedia, so make of that what you will.

(I might speculate that maybe the interlocked-compare-exchange on the writer thread invalidates the cache on the reader? Or maybe all reads cause invalidation? Or pointer dereferences cause invalidation? I'm particularly concerned how platform-specific these things sound.)

Update: Just to make it more explicit that the question is about CPU cache invalidation and what guarantees .NET provides (and how those guarantees might depend on CPU architecture):

Say we have a reference stored in field Q (a memory location).
On CPU A (writer) we initialize an object at memory location R, and write a reference to R into Q
On CPU B (reader), we dereference field Q, and get back memory location R
Then, on CPU B, we read a value from R

Assume the GC does not run at any point. Nothing else interesting happens.

Question: What prevents R from being in B's cache, from before A has modified it during initialisation, such that when B reads from R it gets stale values, in spite of it getting a fresh version of Q to know where R is in the first place?

(Alternate wording: what makes the modification to R visible to CPU B at or before the point that the change to Q is visible to CPU B.)

(And does this only apply to memory allocated with new, or to any memory?)+

Note: I've posted a self-answer here.

Solution

I think I have figured out what the answer is. But I'm not a hardware guy, so I'm open to being corrected by someone more familiar with how CPUs work.

The .NET 2.0 memory model guarantees:

Writes cannot move past other writes from the same thread.

This means that the writing CPU (A in the example), will never write a reference to an object into memory (to Q), until after it has written out contents of that object being constructed (to R). So far, so good. This cannot be re-ordered:

R = <data>
Q = &R

Let's consider the reading CPU (B). What is to stop it reading from R before it reads from Q?

On a sufficiently naïve CPU, one would expect it to be impossible to read from R without first reading from Q. We must first read Q to get the address of R. (Note: it is safe to assume that the C# compiler and JIT behave this way.)

But, if the reading CPU has a cache, couldn't it have stale memory for R in its cache, but receive the updated Q?

The answer seems to be no. For sane cache coherency protocols, invalidation is implemented as a queue (hence "invalidation queue"). So R will always be invalidated before Q is invalidated.

Apparently the only hardware where this is not the case is the DEC Alpha (according to Table 1, here). It is the only listed architecture where dependent reads can be re-ordered. (Further reading.)