io virtual-memory memory-mapped-files mmu

Understanding memory mapping conceptually

I've already asked this question on cs.stackexchange.com, but decided to post it here as well.

I've read several blogs and questions on stack exchange, but I'm unable to grasp what the real drawbacks of memory mapped files are. I see the following are frequently listed:

You can't memory map large files (>4GB) with a 32-bit address space. This makes sense to me now.
One drawback that I thought of was that if too many files are memory mapped, this can cause lower available system resources (memory) => can cause pages to be evicted => potentially more page faults. So some prudence is required in deciding what files to memory map and their access patterns.
Overhead of kernel mappings and data structures - according to Linus Torvalds. I won't even attempt to question this premise, because I don't know much about the internals of Linux kernel. :)
If the application is trying to read from a part of the file that is not loaded in the page cache, it (the application) will incur a penalty in the form of a page-fault, which in turn means increased I/O latency for the operation.

QUESTION #1: Isn't this the case for a standard file I/O operation as well? If an application tries to read from a part of a file that is not yet cached, it will result in a syscall that will cause the kernel to load the relevant page/block from the device. And on top of that, the page needs to be copied back to the user-space buffer.

Is the concern here that page-faults are somehow more expensive than syscalls in general - my interpretation of what Linus Torvalds says here? Is it because page-faults are blocking => the thread is not scheduled off the CPU => we are wasting precious time? Or is there something I'm missing here?

No support for async I/O for memory mapped files.

QUESTION #2: Is there an architectural limitation with supporting async I/O for memory mapped files, or is it just that it no one got around to doing it?

QUESTION #3: Vaguely related, but my interpretation of this article is that the kernel can read-ahead for standard I/O (even without fadvise()) but does not read-ahead for memory mapped files (unless issued an advisory with madvice()). Is this accurate? If this statement is in-fact true, is that why syscalls for standard I/O maybe faster, as opposed to a memory mapped file which will almost always cause a page-fault?

Solution

QUESTION #1: Isn't this the case for a standard file I/O operation as well? If an application tries to read from a part of a file that is not yet cached, it will result in a syscall that will cause the kernel to load the relevant page/block from the device. And on top of that, the page needs to be copied back to the user-space buffer.

You do the read to a buffer and the I/O device will copy it there. There are also async reads or AIO where the data will be transferred by the kernel in the background as the device provides it. You can do the same thing with threads and read. For the mmap case you don't have control or do not know if the page is mapped or not. The case with read is more explicit. This follows from,

ssize_t read(int fd, void *buf, size_t count);

You specify a buf and count. You can explicitly place where you want the data in your program. As a programmer, you may know that data will not be used again. Subsequent calls to read can then reuse the same buf from the last call. This has multiple benefits; the easiest to see is less memory use (or at least address space and MMU tables). mmap will not know whether a page is still going to be accessed in the future or not. mmap does not know that only some data in the page was of interest. Hence, read is more explicit.

Imagine you have 4096 records of size 4095 bytes on a disk. You need to read/look at two random records and perform an operation on them. For read, you can allocate two 4095 buffer with malloc() or use static char buffer[2][4095] data. The mmap() must map on average 8192 bytes for each record to fill two pages or 16k total. When accessing each mmap record, the record spans two pages. This results in two page faults per record access. Also, the kernel must allocate four TLB/MMU pages to hold the data.

Alternatively, if read to sequential buffers, only two pages are needed, with only two syscalls (read). Also, if the computation on the records are extensive, the locality of the buffers will make it much faster (CPU cache hits) than the mmap data.

And on top of that, the page needs to be copied back to the user-space buffer.

This copy may not be as bad as you believe. The CPU will cache data so that the next access doesn't have to reload from main memory with can be 100x slower than L1 CPU cache.

In the case above, mmap can take over two times as long as a read.

Is the concern here that page-faults are somehow more expensive than syscalls in general - my interpretation of what Linus Torvalds says here? Is it because page-faults are blocking => the thread is not scheduled off the CPU => we are wasting precious time? Or is there something I'm missing here?

I think the main point is you don't have control with mmap. You mmap the file and have no idea if any part is in memory or not. If you just randomly access the file, then it will keep reading it back from disk and you may get thrashing depending on the access pattern without knowing. If the access is purely sequential, then it may not seem better at first glance. However, by re-reading a new chunk to the same user buffer, the L1/L2 CPU cache and TLB of the CPU will be better utilized; both for your process and others in the system. If you read all chunks to a unique buffer and process sequentially, then they will be about the same (see note below).

QUESTION #2: Is there an architectural limitation with supporting async I/O for memory mapped files, or is it just that it no one got around to doing it?

mmap is already similar to AIO, but it has fixed sizes of 4k. Ie, the full mmap file doesn't need to be in memory to start operating on it. Functionally, they are different mechansims to get a similar effect. They are architecturally different.

QUESTION #3: Vaguely related, but my interpretation of this article is that the kernel can read-ahead for standard I/O (even without fadvise()) but does not read-ahead for memory mapped files (unless issued an advisory with madvice()). Is this accurate? If this statement is in-fact true, is that why syscalls for standard I/O maybe faster, as opposed to a memory mapped file which will almost always cause a page-fault?

Poor programming of read can be just as bad as mmap. mmap can use madvise. It is more related to all the Linux MM stuff that has to happen to make mmap work. It all depends on your use case; Either can work better depending on the access patterns. I think that Linus was just saying that neither is a magic bullet.

For instance, if you read to a buffer that is more memory than the system has and you use swap which does the same sort of deal as mmap, you will be worse. You may have a system without swap and mmap for random read access will be fine and allow you to manage files bigger than actual memory. The setup do do this with read will require a lot more code which often means more bugs or if you are naive you will just get an OOM kill message.^note However, if the access is sequential, read is not as much code and it will probably be faster than mmap.

Additional `read` benefits

For some, read offers use of sockets and pipes. Also, char devices such as a ttyS0, will only work with read. This can be beneficial if you author a command line program that gets file names from the command line. If you structure with mmap, it maybe difficult to support these files.

Understanding memory mapping conceptually

Additional read benefits

Additional `read` benefits