Search code examples
c++cmemory-managementmmaplarge-data

Do memory mapped files provide advantage for large buffers?


My program works with large data sets that need to be stored in contiguous memory (several Gigabytes). Allocating memory using std::allocator (i.e. malloc or new) causes system stalls as large portions of virtual memory are reserved and physical memory gets filled up.

Since the program will mostly only work on small portions at a time, my question is if using memory mapped files would provide an advantage (i.e. mmap or the Windows equivalent.) That is creating a large sparse temporary file and mapping it to virtual memory. Or is there another technique that would change the system's pagination strategy such that less pages are loaded into physical memory at a time.

I'm trying to avoid building a streaming mechanism that loads portions of a file at a time and instead rely on the system's vm pagination.


Solution

  • Yes, mmap has the potential to speed things up.

    Things to consider:

    • Remember the VMM will page things in and out in page size blocked (4k on Linux)
    • If your memory access is well localised over time, this will work well. But if you do random access over your entire file, you will end up with a lot of seeking and thrashing (still). So, consider whether your 'small portions' correspond with localised bits of the file.
    • For large allocations, malloc and free will use mmap with MAP_ANON anyway. So the difference in memory mapping a file is simply that you are getting the VMM to do the I/O for you.
    • Consider using madvise with mmap to assist the VMM in paging well.
    • When you use open and read (plus, as erenon suggests, posix_fadvise), your file is still held in buffers anyway (i.e. it's not immediately written out) unless you also use O_DIRECT. So in both situations, you are relying on the kernel for I/O scheduling.