Search code examples
c++algorithmperformancefile-iomemory-mapped-files

How to use file input/output functions efficiently on large files (using limited size of memory)


There is an algorithm I want to implement on C++, which includes many file i/o's. Although I have implemented similar things on smaller scales, this time I need to work on files of several GB's. I know that there are some new things I should be considering when the file size is greater than the available memory size, and I should also be concerned about to cost.

My plan is to get allocated memory size and use it to read a predetermined portion and save the results on a txt file for each pass. However, I will need to read and modify the resultant txt file line by line after each pass to update it, since the resultant txt file will be a linked list (byte blocks will correspond to nodes).

Is it efficient to keep results of those passes on a txt file and update it line by line for each pass? I would appreciate if you can let me know any change that can make the algorithm more efficient. I would also appreciate if you could write some short/quick examples since I never used file input output other than "read this entire file", "write this as entire file" type of commands.

Edit: Operating system is Linux and Mac OS.

There are many byte segments repeating inside a binary file and I want to sort the number of times some combinations repeat. For example if a binary file is 111111100000001110101010100000111, I will count the number of occurrence of some predetermined patterns such as 110111001010, 10101011 etc. and sort them. Minimum file size I expect is 1GB and maximum is around 10-20GB. I will look for approximately 1,000,000,000 patterns and I will sort them all. So I thought since I need to update the output file every time my buffer is full, I might as well make it a linked list and update the list (should be ~O(n)) to avoid making a quick sort(should be ~nlog(n)) at the end.


Solution

  • Here's an efficient way to do this:

    Open your source-file and access your data with mmap(). This way you are accessing the OS disk-cahe directly and you eliminate copying the memory from kernel mode to user mode. If your files are really big, it is best to use smaller mmapp-ed views to prevent the creation of large page-tables.

    Depending on the number of distinct patterns you are using, you have the following options:

    If the number of patterns is small enough to fit in memory:

    • If the values are sparse: store them in a map with pattern/count pairs.
    • If the values are somewhat continuous, store the counts in a vector, where the position is the value of your pattern, based on an offset if needed.

    If the number of patterns can get big:

    (you're talking about 1 billion patterns - depends on how unique they are), you could create a mmap-ed outputfile and store the counts there, but make sure that all the values (or pairs) are the same width, i.e. store everything in binary (you can use this just as you would use an array).

    If most of the values are distinct, store them at the position of your pattern-value - for example, if pattern (32bit?) + count is 8 bytes, store them at position pattern-value * 8 for quick access. In case there are large gaps in your pattern-values, but you want to avoid inserting an moving data, consider using a (temporary) sparse file to store the values directly at the right position.

    If you only needed a count, you could store the counts (32bit) only, at their specific position, but if you need a sort you'll also need the pattern values somehow.

    To sort them, I would prefer using radix sort.