Search code examples
c++boostfile-mapping

IO from a mapped file vs IO using filestreams


I am working on an application which needs to deal with large amounts of data(in GBs). I don't need all the data at once at any moment of time. It is ok to section the data and work only on(and thus bring it into memory) a section at any given instance.

I have read that most applications which need to manipulate large amounts of data, usually do so by making use of memory mapped files. Reading further about memory mapped files, I found that reading/writing data from/into memory mapped files is faster than normal file IO because we end up using highly optimized page file algorithms for performing the read write.

Here are the queries that I have:

  1. How different is using memory mapped files(I am planning to use boost::file_mapping and I am working on windows) for file IO than using file streams?
  2. How much faster can I expect the data read/writes to be in case of memory mapped files when compared to using file streams(on a traditional hard disk 7200 rpm)?
  3. Is memory mapped files the only way to deal with such huge amounts of data? Are there better ways of doing this(considering my use case)?

Solution

  • (Disclaimer: I am the author of proposed Boost.AFIO)

    How different is using memory mapped files(I am planning to use boost::file_mapping and I am working on windows) for file IO than using file streams?

    Grossly simplified answer:

    Memory mapped files do reads in 4Kb chunks lazily i.e. when you first access that 4Kb page. File streams do the read when you ask for the data.

    More accurate answer:

    Memory mapped files give you direct access to the kernel page cache for file i/o. You see exactly what the kernel keeps cached for some open file. Reads and writes are directly to the kernel page cache - one can go no faster for buffered i/o.

    How much faster can I expect the data read/writes to be in case of memory mapped files when compared to using file streams(on a traditional hard disk 7200 rpm)?

    Probably not noticeable. If you benchmark a difference, it's likely confounding factors like differing caching algorithms. A hard drive is so slow it'll always be the dominant factor.

    Now if you were really asking how efficient the two are compared from the point of load on the system, then memory mapped files are likely to be far more efficient. STL iostreams copies memory at least once, plus on Windows most "immediate" i/o is really a memcpy from a small internal memory map configured by the Windows kernel for your process, so that's two memory copies of everything you read, minimum.

    The most efficient of all is always O_DIRECT/FILE_FLAG_NO_BUFFERING with all the gotchas that comes with, but it is very rare you'll write a caching algorithm much better than the operating system's. They have, after all, spent decades tuning their algorithms.

    Is memory mapped files the only way to deal with such huge amounts of data? Are there better ways of doing this(considering my use case)?

    Memory mapped files lets the kernel cache a very large dataset for you using general purpose caching algorithms which make use of all free memory in your system. Generally speaking you will not beat them with your own algorithms for most use cases.