caching operating-system x86-64 cpu-cache dma

cache coherency (particular case of cache physically tagged)

Imagine that you have a process that has finished (not now in memory) but, when it was running, it used the 0x12345000 physical address (4KB pages). Now the MMU assigns the 0x12345000 (physical) to another process that just had started. But maybe you have in caché (physically tagged) the 0x12345 tag with the data of the previous process. This is a coherency problem. How is it solved?

EDIT: The assumption is: One process finish and another process is carried from disk to memory to that same page of memory to run. My question is: what is done to prevent problems in this? I understood that, before the 2nd process was brought to memory, the page was zeroed. So now in caches we have zeros corresponding to that page. But the page has the data of the second process. This is all I have understood, but probably is wrong.

Peter Cordes's answer is perfect!

Solution

But the data remaining in cache is from the previous process

Yes, that's what's supposed to happen. The cache just keeps track of what's in physical memory. That is its only job. It doesn't know about processes.

If the OS doesn't want the new process to see that data, the kernel needs to run some instructions to store new data to that page, overwriting cache and memory contents.

Cache is transparent to this operation; it doesn't matter whether data is still hot in cache, or whether the old process's data has been written back to RAM by the time the kernel reuses that physical page.

(See also comments under the question for some more details).

I understand that the OS zero a physical page but this is in main memory, but I'm talking about the residual data in cache memory.

I think this is the source of your confusion: this zeroing takes place with ordinary store instructions executed by the CPU. The OS runs on the CPU, and will zero a page by looping over the bytes (or words) storing zeros. Those stores are normal cacheable stores that are the same as any other write coming in at the top of the cache/memory hierachy.

If the OS wanted to offload the zeroing to a DMA engine or blitter chip that wasn't cache-coherent, then yes the OS would have to invalidate any cache lines in that page first to avoid the problem you're talking about, losing coherence with RAM. But that's not the normal case.

And BTW, "normal store" can still be pretty fast. e.g. modern x86 CPUs can store 32 or 64 bytes per clock cycle with SIMD instructions, or with rep stosb which is basically a microcoded memset that can internally use wide stores. AMD even has a clzero instruction to zero a full cache line. But these are all still CPU instructions whose view of memory goes through cache.

Loading new code/data for a new process

Modern x86-64 systems have cache-coherent DMA, making this a non-problem. This is easy in modern x86-64 when the memory controllers are built-in to the CPU, so PCIe traffic can check L3 cache on the way past. It doesn't matter what cache lines were still hot in cache from a previous process; DMA into that page evicts those lines from cache. (Or with non-DMA "programmed IO", the data is actually loaded into registers by driver code running on a CPU core, and stored into memory with normal stores, which again are cache-coherent).

https://en.wikipedia.org/wiki/Direct_memory_access#Cache_coherency
Some Xeon system can even DMA into L3 cache, avoiding main-memory latency/bandwidth bottlenecks (e.g. for multi-gigabit networking) and saving power. https://en.wikipedia.org/wiki/Direct_memory_access#DDIO

Older systems without cache-coherent do have to be careful to avoid stale cache hits when data in DRAM changes. This is a real problem, and it's not limited to starting a new process. Reusing a just-freed (munmapped) page for a new mmap of a different file has to worry about it. Any disk I/O has to worry about this, including writing to disk: you need to get data from cache synced to DRAM where it can be DMAed to disk.

This might require looping over a page and running an instruction like clflush, or the equivalent on other ISAs. (I don't know what OSes did on x86 CPUs that predate clflush, if there were ever any that weren't cache-coherent) You might find something about it in the Linux kernel's doc directory.

This LWN article: DMA, small buffers, and cache incoherence from 2002 might be relevant. At that point, x86 was already said to have cache-coherent DMA, so maybe x86 has always had this. Before SSE, I don't know how x86 could reliably invalidate cache except for wbinv which is extremely slow and system-wide (invalidating all cache lines, not just one page), not really usable for performance reasons.

Either way (coherent or not), an OS wouldn't waste time storing zeros to pages it was about to read from disk. Zeroing is done for a new process's BSS, and any pages it allocates with mmap(MAP_ANONYMOUS), not for its code/data sections.

Also, the executable you're executing as a new process could already be in RAM, in which case you just have to set up the new process's page tables.