c performance performancecounter perf tlb

Two TLB-miss per mmap/access/munmap

for (int i = 0; i < 100000; ++i) {
    int *page = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
                            MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);

    page[0] = 0;

    munmap(page, PAGE_SIZE);
}

I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration (Also ~100000 page-faults and dTLB-load-misses for kernel). Running following command, the result is roughly 2x what I expect. I would appreciate if someone could clarify why this is the case:

perf stat -e dTLB-store-misses:u ./test
Performance counter stats for './test':

           200,114      dTLB-store-misses

       0.213379649 seconds time elapsed

P.S. I have verified and am certain that the generated code doesn't introduce anything that would justify this result. Also, I do get ~100000 page-faults and dTLB-load-misses:k.

Solution

I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration

I would expect that:

CPU tries to do page[0] = 0;, tries to load the cache line containing page[0], can't find the TLB entry for it, increments dTLB-load-misses, fetches the translation, realises the page is "not present", then generates a page fault.
Page fault handler allocates a page and (because the page table was modified) ensures that the TLB entry is invalidated (possibly by relying on the fact that Intel CPU's don't cache "not present" pages anyway, not necessarily by explicitly doing an INVLPG). The page fault handler returns to the instruction that caused the fault so it can be retried.
CPU tries to do page[0] = 0; a second time, tries to load the cache line containing page[0], can't find the TLB entry for it, increments dTLB-load-misses, fetches the translation, then modifies the cache line.

For fun, you could use the MAP_POPULATE flag with mmap() to try to get the kernel to pre-allocate the pages (and avoid the page fault and the first TLB miss).