I've searched through a lot of resources, but found nothing concrete on the matter:
I know that with some linux systems, a fork()
syscall works with copy-on-write; that is, the parent and the child share the same address space, but PTE is now marked read-only, to be used later of COW. when either tries to access a page, a PAGE_FAULT
occur and the page is copied to another place, where it can be modified.
However, I cannot understand how the OS reaches the shared PTEs to mark them as "read". I have hypothesized that when a fork()
syscall occurs, the OS preforms a "page walk" on the parent's page table and marks them as read-only - but I find no confirmation for this, or any information regarding the process.
Does anyone know how the pages come to be marked as read only? Will appreciate any help. Thanks!
Linux OS implements syscall fork with iterating over all memory ranges (mmap
s, stack and heap) of parent process. Copying of that ranges (VMA - Virtual memory areas is in function copy_page_range
(mn/memory.c) which has loop over page table entries:
copy_page_range
will iterate over pgd and callcopy_pud_range
to iterate over pud and callcopy_pmd_range
to iterate over pmd and callcopy_pte_range
to iterate over pte and callcopy_one_pte
which does memory usage accounting (RSS) and has several code segments to handle COW case: /*
* If it's a COW mapping, write protect it both
* in the parent and the child
*/
if (is_cow_mapping(vm_flags)) {
ptep_set_wrprotect(src_mm, addr, src_pte);
pte = pte_wrprotect(pte);
}
where is_cow_mapping
will be true for private and potentially writable pages (bitfield flags is checked for shared and maywrite bits and should have only maywrite bit set)
#define VM_SHARED 0x00000008
#define VM_MAYWRITE 0x00000020
static inline bool is_cow_mapping(vm_flags_t flags)
{
return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE;
}
PUD, PMD, and PTE are described in books like https://www.kernel.org/doc/gorman/html/understand/understand006.html and in articles like LWN 2005: "Four-level page tables merged".
How fork implementation calls copy_page_range
:
do_fork
(kernel/fork.c) which will call copy_process
which will call many copy_* functions, includingcopy_mm
which callsdup_mm
to allocate and fill new mm struct, where most work is done by dup_mmap
(still kernel/fork.c) which will check what was mmaped and how. (Here I was unable to get exact path to COW implementation so I used the Internet Search Machine with something like "fork+COW+dup_mm" to get hints like [1] or [2] or [3]). After checking mmap types there is retval = copy_page_range(mm, oldmm, mpnt);
line to do real work.