I am doing a simple test on monitoring page faults with the code below, What I don't know is how a simple one line of code below doubled my page fault count. if I use
ptr[i+4096] = 'A'
I got 25,722 page-faults with perf tool, which is what I expected, but if I use
tmp = ptr[i+4096]
instead, the page-faults doubled to 51,322 I don't how to explain it. Below is the complete code. Thanks!
void do_something() {
int i;
char* ptr;
char tmp;
ptr = malloc(100*1024*1024);
int j = 0;
int k = 0;
for (i = 0; i < 100*1024*1024; i+=4096) {
//ptr[i+4096] = 'A' ;
tmp = ptr[i+4096];
for (j = 0 ; j < 4096; j++)
ptr[i+j] = (char) (i & 0xff); // pagefault
}
free(ptr);
}
int main(int argc, char* argv[]) {
do_something();
return 0;
}
Machine Info: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 63 Model name: Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz Stepping: 2 CPU MHz: 3096.188 BogoMIPS: 6197.81 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39
3.10.0-514.32.3.el7.x86_64 #1
malloc()
will often satisfy requests for memory by asking the OS for new pages, e.g., via mmap
. Such pages are generally allocated lazily: no actual page is allocated until the first access.
What happens then depends on the type of the first access: when you do a read first, Linux will map in a shared read-only COW page of zeros to satisfy it, and then if you later you write it takes a second fault to allocate the private writeable page.
When you just do the write first, the first step is skipped. That's the usual case since code generally isn't reading from newly allocated memory which has undefined contents (at least when you get it from malloc
).
Note that the above is a description of how newly allocated pages work in Linux - when you use malloc
there is another layer: malloc
will generally try to satisfy requests for blocks the process freed earlier, rather than continually requesting new memory. In the case memory is re-used, it will generally already be paged in and the above won't apply. Of course for your initial big allocation of 1024 MiB, where is no memory to re-use so you can be sure the allocator is getting it from the OS.