For my neural-network-training project, I've got a very large file of input data. The file format is binary, and it consists of a very large number of fixed-size records. The file is currently ~13GB, but in the future it could become larger; for the purposes of this question let's assume it will be too large to just hold all of it in my computer's RAM at once.
Today's problem involves a little utility program I wrote (in C++, although I think choice of language doesn't matter too much here as one would likely encounter the same problem in any language) that is intended to read the big file and output a similar big file -- the output file is to contain the same data as the input file, except with the records shuffled into a random ordering.
To do this, I mmap()
the input file into memory, then generate a list of integers from 1 to N (where N is the number of records in the input file), randomly shuffle the ordering of that list, then iterate over the list, writing out to the output file the n'th record from the mmap'd memory area.
This all works correctly, as far as it goes; the problem is that it's not scaling very well; that is, as the input file's size gets bigger, the time it takes to do this conversion is increasing faster than O(N). It's getting to the point where it's become a bottleneck for my workflow. I suspect the problem is that the I/O system (for MacOS/X 10.13.4, using the internal SSD of my Mac Pro trashcan, in case that's important) is optimized for sequential reads, and jumping around to completely random locations in the input file is pretty much the worst-case scenario as far as caching/read-ahead/other I/O optimizations are concerned. (I imagine that on a spinning disk it would perform even worse due to head-seek delays, but fortunately I'm at least using SSD here)
So my question is, is there any clever alternative strategy or optimization I could use to make this file-randomization-process more efficient -- one that would scale better as the size of my input files increases?
Thanks to advice of various people in this thread (in particular Marc Glisse and Andrew Henle) I was able to reduce the execution time of my program on a 13GB input file, from ~16 minutes to ~2 minutes. I'll document how I did it in this answer, since the solution wasn't very much like either of the answers above (it was more based on Marc's comment, so I'll give Marc the checkbox if/when he restates his comment as an answer).
I tried replacing the mmap() strategy with pread(), but that didn't seem to make much difference; and I tried passing F_NOCACHE and various other flags to fcntl(), but they seemed to either have no effect or make things slower, so I decided to try a different approach.
The new approach is to do things in a 2-layer fashion: rather than reading in single records at a time, my program now loads in "blocks" of sequential records from the input file (each block containing around 4MB of data).
The blocks are loaded in random order, and I load in blocks until I have a certain amount of block-data held in RAM (currently ~4GB, as that is what my Mac's RAM can comfortably hold). Then I start grabbing random records out of random in-RAM blocks, and writing them to the output file. When a given block no longer has any records left in it to grab, I free that block and load in another block from the input file. I repeat this until all blocks from the input file have been loaded and all their records distributed to the output file.
This is faster because all of my output is strictly sequential, and my input is mostly sequential (i.e. 4MB of data is read after each seek rather than only ~2kB). The ordering of the output is slightly less random than it was, but I don't think that will be a problem for me.