caching concurrency x86 cpu-architecture compare-and-swap

Is cache coherency only an issue when storing and not when loading?

I came across this code emission for x64 were "Atomic Load" is using a simple movq whereas "Atomic Store" is using xchgq.

This link explains that Atomic Load/Stores on aligned addresses are atomic by default. I'm assuming that's why Atomic Load in the above link is using a simple movq.

I have the following questions;

Is Atomic Store using a xchgq (which enables LOCK by default) to fix any issues with cache lines? essentially it's making sure all cache lines are updated properly? If cache line wasn't an issue they could have just used movq?
Does it also mean cache coherency is only an issue when Storing? As Load above is not using a locked instruction?

Solution

No, seq_cst stores use xchg (or mov + mfence but that's slower on recent CPUs) for ordering wrt. other operations. release or relaxed atomic stores can just use mov and will still be promptly visible to other cores. (Not before later loads in this thread might have executed, though.)

Cache coherence isn't the cause of memory-reordering, that's local to each core. (For x86, the memory model is program order + a store buffer with store-forwarding. It's the store buffer that causes stores to not become visible until after the store instruction has retired from out-of-order exec.)

The answer you linked which says "if I set this to true (or false), no other thread will read a different value after I've set it" (that's not quite such a certainty - you need a "lock" prefix to guarantee that). is somewhat misleading. They mean that (implicit-lock) xchg includes a full memory barrier, so no code in the storing thread can access memory until after the store is actually committed to cache, globally visible.

A clearer way to state that is that it makes this thread wait without doing anything until the store is visible. i.e. stall this thread until the store buffer has finished committing all previous stores. That would eventually happen on its own. So it's really about ordering of this thread relative to store visibility, not other threads. Other threads (cores) can locally do their own early loading / late storing, although on x86 all loads happen in program order. That's why I commented on that answer you linked to disagree with the way it was presenting things.

Can a speculatively executed CPU branch contain opcodes that access RAM? (What a store buffer does)
C++ How is release-and-acquire achieved on x86 only using MOV? discusses cache-coherency and limits on local reordering being enough to give release/acquire synchronization.
Why does a std::atomic store with sequential consistency use XCHG?
Acquire-release on x86
Which is a better write barrier on x86: lock+addl or xchgl? - shows in more detail why we need xchg or a separate memory barrier for a seq_cst store.
https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/ (It's talking about Java volatile, which is like C++ std::atomic with memory_order_seq_cst.
How does memory reordering help processors and compilers?
Does atomic read guarantees reading of the latest value? - people often get hung up on "latest value" guarantees. Don't. Acquire/release just works, and stronger orders or memory barriers don't make stores visible to other cores sooner in any significant way.