Search code examples
c++large-filesfseek

performance of fseeko(FILE* stream, off_t offset, int whence)


MY QUESTION:

I've got a file that's about 4GB in size. I've never used fseeko/ftello, and I'm not too familiar with how files are organized on a disk. If I open a file and then ask fseeko to jump to, say, the 2,348,973,408th byte of the file, does it have to sequentially traverse thousands upon thousands of block headers (or some such thing) that are chained together like a linked-list to get to the middle of my 4GB file? Or does it have a more efficient way of randomly accessing deep into a file? I'm looking for a way to efficiently jump to specific byte of a very large file. If this doesn't work efficiently, I've thought about breaking the file into, say, 4000 one megabyte files which can each be more efficeintly fseek'ed. Any suggestions?

BACKGROUND:

I've computed a large 6-dimensional array of nearly a half billion double precision numbers which represent a solution to a complex problem. At 8 bytes each, the data set, written as a single file, takes nearly 4 GB of disk space. I want to write a little server application that takes requests for small contiguous ranges of this data, and returns the requested data.

Now I could just read the whole file into RAM, but I intend to leave this server up and running all the time and that eats up half of the 8GB of RAM on my server. So I don't want to do that. Instead, I want to leave it on the disk and read the page of data requested, respond to the request, and then drop the page from RAM again.

My next thought was to load the data into a database, but by the time I store the 6 indicees along with each 8-byte data value, and throw an index on the table (for fast look ups), I'm thinking the size of the database would be an order of magnitude bigger than the 4GB file. That wouldn't be the end of the world, but I may add more of these large files in the future, and I'd rather not have so much data lying around. There may be other options here: I could store an entire page of data into a single row using a binary varchar or some such thing.

But this got me wondering if I couldn't find some way to efficiently access the data straight from a file. I know what bytes of the file I want. The question is whether there's a fast way to get to them. Hence my question about fseeko above.


Solution

  • In principle, fseek() should be fast.

    However there is a small attention to have on whether you are using a file in text mode or not:

    • If in text mode, fseek() is guaranteed to work according to standard only for positions previously returned by ftell(), or for 0 from the begin of the file. Support for other paramter combination are implementation dependent. Fortuantely , on most OS it also works with 0 from the end of the file.

    • In binary mode you have no such restriction.

    The purpose of the text mode restriction is to avoid inconsistencies that might arise from using direct positioning (because text mode has no one to one mapping between bytes read and bytes on disk).

    Edit: additional infos abour your background

    I suppose that you are using a binary file to store all these numbers and in a fixed size format:

    • in my opinion, a database would not make sense here.

    • if your server runs a 64 bit OS and if you have enough disk space for your swapping area, you could opt for loading the full dataset into memory : it would be loaded into virtual memory, and OS would take care of optimizing the memory pages loaded into available RAM.

    • if you browse through your file with a very irregular pattern, the swapping might also trigger a lot of file reads. Then, the use fseek() to go directly to the position calculated with the indexes of your 6 dimensions would be a sensible option.

    • finally, you also have the option to use memory mapped files such as the POSIX mmap() or the windows MapViewOfFile(). This is very well suited for arrays. However, unfortunately, this is not as portable as standard C++.