for (int i = 0; i < 100000; ++i) {
int *page = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
page[0] = 0;
munmap(page, PAGE_SIZE);
}
I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration (Also ~100000 page-faults and dTLB-load-misses for kernel). Running following command, the result is roughly 2x what I expect. I would appreciate if someone could clarify why this is the case:
perf stat -e dTLB-store-misses:u ./test
Performance counter stats for './test':
200,114 dTLB-store-misses
0.213379649 seconds time elapsed
P.S. I have verified and am certain that the generated code doesn't introduce anything that would justify this result. Also, I do get ~100000 page-faults and dTLB-load-misses:k.
I expect to get ~100000 dTLB-store-misses in userspace, one per each iteration
I would expect that:
page[0] = 0;
, tries to load the cache line containing page[0]
, can't find the TLB entry for it, increments dTLB-load-misses
, fetches the translation, realises the page is "not present", then generates a page fault.INVLPG
). The page fault handler returns to the instruction that caused the fault so it can be retried.page[0] = 0;
a second time, tries to load the cache line containing page[0]
, can't find the TLB entry for it, increments dTLB-load-misses
, fetches the translation, then modifies the cache line.For fun, you could use the MAP_POPULATE
flag with mmap()
to try to get the kernel to pre-allocate the pages (and avoid the page fault and the first TLB miss).