Search code examples
linux-kernelforkvirtual-memorycopy-on-writepage-tables

How does fork() process mark parent's PTE's as read only?


I've searched through a lot of resources, but found nothing concrete on the matter:

I know that with some linux systems, a fork() syscall works with copy-on-write; that is, the parent and the child share the same address space, but PTE is now marked read-only, to be used later of COW. when either tries to access a page, a PAGE_FAULT occur and the page is copied to another place, where it can be modified.

However, I cannot understand how the OS reaches the shared PTEs to mark them as "read". I have hypothesized that when a fork() syscall occurs, the OS preforms a "page walk" on the parent's page table and marks them as read-only - but I find no confirmation for this, or any information regarding the process.

Does anyone know how the pages come to be marked as read only? Will appreciate any help. Thanks!


Solution

  • Linux OS implements syscall fork with iterating over all memory ranges (mmaps, stack and heap) of parent process. Copying of that ranges (VMA - Virtual memory areas is in function copy_page_range (mn/memory.c) which has loop over page table entries:

        /*
         * If it's a COW mapping, write protect it both
         * in the parent and the child
         */
        if (is_cow_mapping(vm_flags)) {
            ptep_set_wrprotect(src_mm, addr, src_pte);
            pte = pte_wrprotect(pte);
        }
    

    where is_cow_mapping will be true for private and potentially writable pages (bitfield flags is checked for shared and maywrite bits and should have only maywrite bit set)

    #define VM_SHARED   0x00000008
    #define VM_MAYWRITE 0x00000020
    
    static inline bool is_cow_mapping(vm_flags_t flags)
    {
        return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
    }
    

    PUD, PMD, and PTE are described in books like https://www.kernel.org/doc/gorman/html/understand/understand006.html and in articles like LWN 2005: "Four-level page tables merged".

    How fork implementation calls copy_page_range:

    • fork syscall implementation (sys_fork? or syscall_define0(fork)) is do_fork (kernel/fork.c) which will call
    • copy_process which will call many copy_* functions, including
    • copy_mm which calls
    • dup_mm to allocate and fill new mm struct, where most work is done by
    • dup_mmap (still kernel/fork.c) which will check what was mmaped and how. (Here I was unable to get exact path to COW implementation so I used the Internet Search Machine with something like "fork+COW+dup_mm" to get hints like [1] or [2] or [3]). After checking mmap types there is retval = copy_page_range(mm, oldmm, mpnt); line to do real work.