Search code examples
linuxmmapmemory-mapped-filesxfs

What updates mtime after writing to memory mapped files?


I'm using XFS on Linux and have a memory mapped file to which I write once per second. I notice that the file mtime (shown by watch ls --full-time) changes periodically but irregularly. The gap between mtimes seems to be between 2 and 20 seconds but it is not consistent. There is very little else running on the system--in particular there's only one program of mine writing the file, plus one reading.

The same program writes much more frequently to some other mmapped files, and their mtime changes exactly once per 30 seconds.

I am not using msync() (which would update mtime when called).

My questions:

  1. What updates mtime?
  2. Is the update interval configurable?
  3. Why do some mtimes get updated exactly once per 30 seconds but some files which I write less frequently have fresher (irregular but always less than 30 seconds old) mtimes?

Solution

  • When you mmap a file, you're basically sharing memory directly between your process and the kernel's page cache — the same cache that holds file data that's been read from disk, or is waiting to be written to disk. A page in the page cache that's different from what's on disk (because it's been written to) is referred to as "dirty".

    There is a kernel thread that scans for dirty pages and writes them back to disk, under the control of several parameters. One important one is dirty_expire_centisecs. If any of the pages for a file have been dirty for longer than dirty_expire_centisecs then all of the dirty pages for that file will get written out. The default value is 3000 centisecs (30 seconds).

    Another set of variables is dirty_writeback_centisecs, dirty_background_ratio, and dirty_ratio. dirty_writeback_centisecs controls how often the kernel thread checks for dirty pages, and defaults to 500 (5 seconds). If the percentage of dirty pages (as a fraction of the memory available for caching) is less than dirty_background_ratio then nothing happens; if it's more than dirty_background_ratio, then the kernel will start writing some pages to disk. Finally, if the percentage of dirty pages exceeds dirty_ratio, then any processes attempting to write will block until the amount of dirty data decreases. This ensures that the amount of unwritten data can't increase without bound; eventually, processes producing data faster than the disk can write it will have to slow down to match the disk's pace.

    The question of how the mtime gets updated is related to the question of how the kernel knows that a page is dirty in the first place. In the case of mmap, the answer is that the kernel sets the pages of the mapping to read-only. That doesn't mean that you can't write them, but it means that the first time you do, it triggers an exception in the memory-management unit, which is handled by the kernel. The exception handler does (at least) four things:

    1. Marks the page as dirty, so that it will get written back.
    2. Updates the file mtime.
    3. Marks the page as read-write, so that the write can succeed.
    4. Jumps back to the instruction in your program that writes to the mmaped page, which succeeds this time.

    So when you write data to a clean page, it causes an mtime update, but it also causes the page to become read-write, so that further writes don't cause an exception (or an mtime update)note 1. However, when the dirty page gets flushed to disk, it becomes clean, and also becomes "read-only" again, so that any further writes to it will trigger another eventual disk write, and also another mtime update.

    So now, with a few assumptions, we can start to piece together the puzzle.

    First, dirty_background_ratio and dirty_ratio are probably not coming into play. If the pace of your writes was fast enough to trigger background flushes, then most likely you would see the "irregular" behavior on all files.

    Second, the difference between the "irregular" files and the "30 second" files is the page access pattern. I surmise that the "irregular" files are being written to in some sort of append-mode or circular-buffer fashion, such that you start writing to a new page every few seconds. Every time you dirty a previously untouched page, it triggers an mtime update. But for the files displaying the 30-second pattern, you only write to one page (perhaps they are one page or less in length). In that case, the mtime is updated on first write, and then not again until the file is flushed to disk by exceeding dirty_expire_centisecs, which is 30 seconds.

    Note 1: This behavior is, technically, wrong. It's unpredictable, but the standards allow for some degree of unpredictability. But they do require that the mtime be sometime at or after the last write to a file, and at or before an msync (if any). In the case where a page is written to multiple times in the interval before it's flushed to disk, this isn't what happens — the mtime gets the timestamp of the first write. This has been discussed, but a patch that would have fixed it wasn't accepted. Therefore, when using mmap, mtimes can be in error. dirty_expire_centisecs sort of limits that error, but only partially, since other disk traffic might cause the flush to have to wait, extending the window for a write to bypass mtime even further.