Search code examples
linuxvirtual-memorysystemtap

Repeated Minor Pagefaults at Same Address After Calling mlockall()


The Problem

In the course of attempting to reduce/eliminate the occurrence of minor pagefaults in an application, I discovered a confusing phenomenon; namely, I am repeatedly triggering minor pagefaults for writes to the same address, even though I thought I had taken sufficient steps to prevent pagefaults.

Background

As per the advice here, I called mlockall to lock all current and future pages into memory.

In my original use-case (which involved a rather large array) I also pre-faulted the data by writing to every element (or at least to every page) as per the advice here; though I realize the advice there is intended for users running a kernel with the RT patch, the general idea of forcing writes to thwart COW / demand paging should remain applicable.

I had thought that mlockall could be used to prevent minor page faults. While the man page only seems to guarantee that there will be no major faults,various other resources (e.g. above) state that it can be used to prevent minor page faults as well.

The kernel documentation seems to indicate this as well. For example, unevictable-lru.txt and pagemap.txt state that mlock()'ed pages are unevictable and therefore not suitable for reclamation.

In spite of this, I continued to trigger several minor pagefaults.

Example

I've created an extremely stripped down example to illustrate the problem:

#include <sys/mman.h> // mlockall
#include <stdlib.h> // abort

int main(int , char **) {
  int x;

  if (mlockall(MCL_CURRENT | MCL_FUTURE)) abort();

  while (true) {
    asm volatile("" ::: "memory"); // So GCC won't optimize out the write
    x = 0x42;
  }
  return 0;
}

Here I repeatedly write to the same address. It is easy to see (e.g. via cat /proc/[pid]/status | awk '{print $10}') that I continue to have minor pagefaults long after the initialization is complete.

Running a modified version* of the pfaults.stp script included in systemtap-doc, I logged the time of each pagefault, address that triggered the fault, address of the instruction that triggered the fault, whether it was major/minor, and read/write. After the initial faults from startup and mlockall, all faults were identical: The attempt to write to x triggered a minor write fault.

The interval between successive pagefaults displays a striking pattern. For one particular run, the intervals were, in seconds: 2, 4, 4, 4.8, 8.16, 13.87, 23.588, 40.104, 60, 60, 60, 60, 60, 60, 60, 60, 60, ... This appears to be (approximately) exponential back-off, with an absolute ceiling of 1 minute.

Running it on an isolated CPU has no impact; neither does running with a higher priority. However, running with a realtime priority eliminates the pagefaults.

The Questions

  1. Is this behavior expected?
    1a. What explains the timing?
  2. Is it possible to prevent this?

Versions

I'm running Ubuntu 14.04, with kernel 3.13.0-24-generic and Systemtap version 2.3/0.156, Debian version 2.3-1ubuntu1 (trusty). Code compiled with gcc-4.8 with no extra flags, though optimization level doesn't seem to matter (provided the asm volatile directive is left in place; otherwise the write gets optimized out entirely)

I'm happy to include further details (e.g. exact stap script, original output, etc.) if they will prove relevant.


*Actually, the vm.pagefault probe was broken for my combination of kernel and systemtap because it referenced a variable that no longer existed in the kernel's handle_mm_fault function, but the fix was trivial)


Solution

  • @fche's mention of Transparent Huge Pages put me onto the right track.

    A less careless read of the kernel documentation I linked to in the question shows that mlock does not prevent the kernel from migrating the page to a new page frame; indeed, there's an entire section devoted to migrating mlocked pages. Thus, simply calling mlock() does not guarantee that you will not experience any minor pagefaults

    Somewhat belatedly, I see that this answer quotes the same passage and partially answers my question.

    One of the reasons the kernel might move pages around is memory compaction, whereby the kernel frees up a large contiguous block of pages so a "huge page" can be allocated. Transparent huge pages can be easily disabled; see e.g. this answer.

    My particular test case was the result of some NUMA balancing changes introduced in the 3.13 kernel.

    Quoting the LWN article linked therein:

    The scheduler will periodically scan through each process's address space, revoking all access permissions to the pages that are currently resident in RAM. The next time the affected process tries to access that memory, a page fault will result. The scheduler will trap that fault and restore access to the page in question...

    This behavior of the scheduler can be disabled by setting the NUMA policy of the process to explicitly use a certain node. This can be done using numactl at the command line (e.g. numactl --membind=0) or a call to the libnuma library.

    EDIT: The sysctl documentation explicitly states regarding NUMA balancing:

    If the target workload is already bound to NUMA nodes then this feature should be disabled.

    This can be done with sysctl -w kernel.numa_balancing=0

    There may still be other causes for page migration, but this sufficed for my purposes.