What happens when a core write in its L1 cache while another core is having the same line in its L1 too?

What happens when a core write in its L1 cache while another core is having the same line in its L1 too ?

Let say for an intel Skylake CPU.

How does the cache system preserve consistency ? Does it update in real time, does it stop one of the cores ? What's the performance cost of continuously writing in same cache line with two cores ?

Solution

In general modern CPUs use some variant¹ of the MESI protocol to preserve cache coherency.

In your scenario of an L1 write, the details depend on the existing state of the cache lines: is the cache line already in the cache of the writing core? In the other core, in what state is the cache line, e.g., has it been modified?

Let's take the simple case where the line isn't already in the writing core (C1), and it is in the "exclusive" state in the other core (C2). At the point where the address for the write is known, C1 will issue an RFO (request for ownership) transaction onto the "bus" with the address of the line and the other cores will snoop the bus and notice the transaction. The other core that has the line will then transition its line from the exclusive to the invalid state and the value of the value of the line will be provided to the requesting core, which will have it in the modified state, at which point the write can proceed.

Note that at this point, further writes to that line from the writing core proceed quickly, since it is in the M state which means no bus transaction needs to take place. That will be the case until the line is evicted or some other core requests access.

Now, there are a lot of additional details in actual implementations which aren't covered above or even in the wikipedia description of the protocol.

For example, the basic model involves a single private cache per CPU, and shared main memory. In this model, core C2 would usually provide the value of the shared line onto the bus, even though it has not modified it, since that would be much faster than waiting to read the value from main memory. In all recent x86 implementations, however, there is a shared last-level L3 cache which sits between all the private L2 and L1 caches and main memory. This cache has typically been inclusive so it can provide the value directly to C1 without needing to do a cache-to-cache transfer from C2. Furthermore, having this shared cache means that each CPU may not actually need to snoop the "bus" since the L3 cache can be consulted first to determine which, if any, cores actually have the line. Only the cores that have the line will then be asked to make a state transition. Kind of a push model rather than pull.

Despite all these implementation details, the basics are the same: each cache line has some "per core" state (even though this state may be stored or duplicated in some central place like the LLC), and this state atomically undergoes logical transitions that ensure that the cache line remains consistent at all times.

Given that background, here are some specific answers to you final two sub-questions:

Does it update in real time, does it stop one of the cores?

Any modern core is going to do this in real time, and also in parallel for different cache lines. That doesn't mean it is free! For example, in the description above, the write by C1 is stalled until the cache coherence protocol is complete, which is likely dozens of cycles. Contrast that with a normal write which takes only a couple cycles. There are also possible bandwidth issues: the requests and responses used to implement the protocol use shared resources that may have a maximum throughput; if the rate of coherence transactions passes some limit, all requests may slow down even if they are independent.

In the past, when there was truly a shared bus, there may have been some partial "stop the world" behavior in some cases. For example, the lock prefix for x86 atomic instructions is apparently named based on the lock signal that a CPU would assert on the bus while it was doing an atomic transaction. During that entire period other CPUs are not able to fully use the bus (but presumably they could still proceed with CPU-local instructions).

What's the performance cost of continuously writing in same cache line with two cores?

The cost is very high because the line will continuously ping-pong between the two cores as described above (at the end of the process described, just reverse the roles of C1 and C2 and restart). The exact details vary a lot by CPU and even by platform (e.g, a 2-socket configuration will change this behavior a lot), but basically are probably looking at a penalty of 10s of cycles per write versus a not-shared output of 1 write per cycle.

You can find some specific numbers in the answers to this question which covers both the "two threads on the same physical core" case and the "separate cores" case.

If you want more details about specific performance scenarios, you should probably ask a separate specific question that lays out the particular behavior you are interested in.

¹ The variations on MESI often introduce new states, such as the "owned" state in MOESI or the "Forwarded" state in MESIF. The idea is usually to make certain transitions or usage patterns more efficient than the plain MESI protocol, but the basic idea is largely the same.