Why MESI protocol may result in write action that is followed by write action to the same address?

MESI protocol used with write-back. 2 cores on a single processor, only L1 caches for simplicity.

address A was never used
core 1 initiated write action to address A. It results in data saved to its cache and state is set to M.
core 2 initiated write action to address A. It results in cache miss. Core 2 issues RWITM broadcast. Core 1 was snooping, blocked RWITM, value is saved to main memory, its state set to I in its cache. It is followed by core 2 reissuing RWITM and storing value in address A in main memory. State for address A in its cache.

Why core 1 saved anything in main memory, wasting time - it was anyway almost immediately overwritten?

My sources:

http://upload.wikimedia.org/wikipedia/commons/9/99/MESI_protocol_activity_diagram.png

https://www.cs.utexas.edu/~pingali/CS395T/2009fa/lectures/mesi.pdf

http://en.wikipedia.org/wiki/MESI_protocol

Solution

For two reasons:

This is done on a full line granularity. No one guaranteed that both cores modified the same bytes, so core 2 needs to see the update from core 1 first, then do the merge with the bytes it modified, and only then commit its store.
Core 2 may be preempted while in flight, or something else could happen to its write (it still didn't commit it, it's only waiting for the RWITM). In that case, you would lose the data from core 1 just because you trusted core 2. Never trust multi-core hardware to behave as you expect (and trust multi-threaded software even less).

Keep in mind that this is over-simplified by not having any shared cache. If you had one (and assuming it was inclusive), then core 1 would simply write the modification into it, and core 2 would get it much faster. Other systems may also implement direct core-to-core snoops for such cases.

In general, a protocol should assume none of the sides knows anything other than what the protocol explicitly tells it. You have to invalidate the line out of core 1 to keep coherency (no 2 cores can modify the same line simultaneously), and since you have no other cache level - you have to write it to memory and guarantee that the data is not lost (don't count on core 2 for that, as far as core 1 is concerned it doesn't exist, you're responding to a mysterious snoop with the only flow you trust - writeback to memory).

Last thing - this flow ends (in the slides as well, as far as I could see), with core 2 installing the line in its own cache in M-state, with the modification. From this point on, the system can continue in any way (if the line would later be snooped again, or age out of Core 2's cache - it's a different matter). The flow doesn't require Core 2 to write back the line to memory as you stated, so there's no dual write.