Search code examples
memory-managementx86operating-systemcpucpu-cache

OS cache/memory hierarchy: How does writing to a new file work?


I know how read/load operations are theoretically supposed to work in OSes. A read instruction causes a TLB lookup, then a look through caches, then a look in main memory, and finally a read from disk if not satisfied at a previous level.

How does a write operation work for a new file? Clearly when writing to an existing file, one might read the file first and then write to the corresponding cache lines. But a new file would not have any cache lines to write to.

Would a CPU be able to "create" a new cache line that is not yet memory-backed with the write? Or does the CPU have to tell RAM to create some empty memory, then load the empty memory into cache lines, just so it can then write to those empty cache lines? This would imply that all write operations require a load operation beforehand.


Solution

  • When writing a new file, the kernel has to allocate a page of physical memory to hold the file data. Kernel pages holding file contents, including both clean and dirty (not yet written to disk) data, are called the "pagecache". This is unrelated to CPU caches.

    Physical memory exists continuously, it doesn't come into being when allocated. Allocation is just a software mechanism for deciding which stores/loads are going to go where. CPU cache caches based on physical address. (Some old CPUs have used virtually-indexed virtually-tagged L1 caches, at least some old non-x86 CPUs, so software memory allocation has to invalidate virtual caches when page-table mappings change. Modern Intel uop caches are virtually addressed like that, with invalidation being done in hardware.)

    This would imply that all write operations require a load operation beforehand.

    Yes, a store can't commit from the store buffer to cache until this core has MESI Exclusive ownership of the cache line. Normally this involves doing a Read For Ownership so it can update the contents. If storing a whole cache lines at once, it can be possible for a CPU core to just invalidate copies in other caches without costing DRAM bandwidth to read the old value, e.g. with x86 NT stores (non-temporal, e.g. _mm_stream_ps) or rep stos or rep movs. See Enhanced REP MOVSB for memcpy for more about no-RFO stores.


    example: echo "hello" > new_file.txt

    On a POSIX system like Linux, your shell will make two system calls:

    • fd = open("new_file.txt", O_CREAT|O_TRUNC|O_WRONLY, 0666). This creates the inode if it didn't exist already (and if not tmpfs, the filesystem queues up some metadata I/O to go to disk at some point, usually after a timeout in case more metadata ops happen soon after). OSes also cache VFS (virtual filesystem) structure, not just file data. Anyway, the file size is still zero so no pagecache page for the data even has to get allocated yet. If there was data before, O_TRUNC discards it, freeing any pages that were hot in the pagecache for it.

    • write(fd, "hello\n", 6); - Assuming Linux again, the sys_write kernel function (called with the args passed by user-space) will notice that it needs to allocate a physical page to be the pagecache page that holds this file's data. (And maybe zero it, or at least the parts that it's not about to copy to, since I/O to block devices typically works in 512B or 4K sectors, and it's better not to copy stale kernel data onto disk, especially if it might be an unprivileged user's USB stick.)

      Then it'll call copy_from_user to copy from the user-space buffer (containing "hello\n") to that page. We're in the kernel so the page doesn't need to get mapped to a user-space virtual address, and Linux's memory map (e.g. for x86-64) keeps all of physical memory direct-mapped to a range of kernel virtual address-space, so virt_address = phys_address + page_offset_base. (Except on systems like 32-bit x86 that might have more physical memory than kernel virtual address-space, thus highmem shenanigans... Linus Torvalds had a good rant about how much PAE sucks for kernel software, explaining along the way some OS principles. Not all kernels want to keep all memory mapped all the time, but Linux does because it's simple and efficient.)

      copy_from_user will check the user address for validity (so e.g. user-space can't pass a kernel address to get the kernel to copy arbitrary data into a file!), and that it's currently present in the hardware page table so reading it won't cause a #PF page-fault exception.

      The actual copying will be with rep movsb as a memcpy (unless copy_from_user special-cases small copies). For this small 6-byte copy, it probably doesn't try to do any no-RFO special stuff, so it's essentially equivalent to mov rax, [rsi] / mov [rdi], rax if it had been an 8-byte copy.

    This store by the kernel into a newly-allocated page isn't "special", it's just like what happens when your user-space code stores to memory that it hasn't touched for a while (but which doesn't trigger a page fault).

    Assuming this physical page wasn't recently written by this core, the store will miss in at least L1d and L2 cache, and this core will send out an RFO to get a copy of the line along with ownership. After it gets a reply, the store buffer entry can commit to L1d cache. The RFO is done by hardware; the code that executes is just rep movsb (or an equivalent mov [rdi], rax)

    Actually, if the kernel did just zero this page as part of allocating it, rep stosb would have used a no-RFO store protocol to invalidate any cache lines in any other cores that might have been caching lines of this page, but (unlike NT stores) the data will be hot in cache, already in MESI Modified state, so the store by copy_from_user will already be able to commit without further off-core communication.

    (AMD has a clzero instruction that zeros a whole cache line as an NT store which doesn't pollute cache this way, but often pages are used right after being zeroed, like in this case.)