c++memory memory-management out-of-memory memory-mapped-files

Using memory mapped files for in program temporary arrays?

I'm currently writing a program which shall be able to handle out of core data. So I'm processing files from the size starting at 1MB up to 50GB (and possibly larger in future).

I have read several tutorials regarding memory mapped files and am now using the the memory mapped files for managing data IO , i.e. reading and writing data from/to the hard drive.

Now I also process the data and need some temporary arrays of the same size as the data is. My question now is, if I should also use memory mapped files for that or if I should somehow get it managed by the OS without explicitly defining memory mapped files. The problem is as follows:

I'm working on multiple platforms but always with 64bit systems. In theory, the 64bit virtual address space is definetly sufficient for my needs. However, in Windows the maximum virtual address space seems to be limited by the operating system, i.e. a user can set, if paging is allowed and which maximum virtual memory size is allowed. Also I read somewhere, that the maximum virtual memory in Windows 64 isn't 2^64 but somewhere by 2^40 or similar, which would still be sufficient for me, but seems to be a quite odd limitation. Furthermore, Windows has some strange limitations such as arrays with a maximum size of 2^31 elements, independent of the array type. I don't know how all of thisis handled on linux, but I think its treated similar. Probably the maximum allowed virtual memory=OS-RAM+Swap partition size? So there are a lot of things to struggle with if I want to use the system to handle my data exceeding the ram size. I don't even know if I can use in c++ the entire 64bit virtual address space somehow. In my short test, I got an compiler error not being able to initialze mot than 2^31 elements, but I think, it is easy to go beyond that by using std::vector and such.

However, on the other hand, by using a memory mapped file, it will always be data written to the hdd with all my memory write operations. Especially for data which is smaller then my physical RAM, this is supposed to be a fairly huge bottleneck. Or does it avoid writing until it has to because the RAM is exceeded??? Memory mapped files advantages come up in inter process communications with shared memory or temporal communications such that I start the application, write something, quit the application and later restart it and read efficiently only those data to RAM which I need. As I need to process the entire data and only in one execution instance with one process, both advantages don't come up in my case.

Note: A streaming approach as alternate solution to my problem is not really feasible as I heavily depend on random access to the data.

What I ideally would like to have is a way that I can process all models independent of their size and operating limit set limitations but handle all whats possible in the RAM and only if the physical limit is exceeded, use memory mapped files or other mechanisms (if there are any others) for paging out the RAM exceeding data, ideally managed by the operating systemm.

To conclude, whats the best approach to handle this temporary existing data? If I can do it without memory mapped files and platform independent, can you give me any code snippet or something like this and explain how it works to avoid these OS limitations?

Solution

Maybe a bit late, but it's an interesting question.

However, on the other hand, by using a memory mapped file, it will always be data written to the hdd with all my memory write operations. Especially for data which is smaller then my physical RAM, this is supposed to be a fairly huge bottleneck. Or does it avoid writing until it has to because the RAM is exceeded???

To avoid writing to disk while there's enough memory, you should open the file as 'temporary' (FILE_ATTRIBUTE_TEMPORARY) with FILE_FLAG_DELETE_ON_CLOSE. This will hint the OS to delay writing to disk as long as possible.

As for limitations on array size: it's probably best to provide your own datastructures and access to the mapped views. For big datasets you may want to use several different (smaller) mapped views, which you can map and unmap as needed.