Search code examples
cfilefile-writingdata-handling

Efficient Output Format for Huge Data Sets?


I have written a program that writes output to a file. The output is in 6-column, n-row format, and all values are double-precision float. It is very common in my code for n to become extremely large (1e20 or so) and, hence, the output data file also becomes extremely large.

I am currently storing everything in *.csv format, which obviously results in huge data files. Is there any more efficient way to store such values? Any new format of file or any new method that would decrease the file size significantly?

For clarification: The data does not need to be human readable, binary would do just fine. I will further process the data in the file to get some important parameters from the runs, probably the distance of travel, time of exit at a particular point etc. The code is actually an astrophysical simulation of moving particles and for about 1e10 particles for a million time steps each, it gets quite high for size.


Solution

  • When designing a file format you have to consider various things, like:

    a) Is there a chance that the file could have been corrupted, or maliciously tampered with (or is there any kind of confidentiality requirement)? The answer to this is almost always "yes". To guard against these things you need to consider some kind of checksum and/or encryption. You may also need to consider whether partial recovery is desirable (e.g. is it beneficial to split the file into multiple blocks/sections where each block has its own checksum/encryption, so that if 4 bytes in one block/section are corrupted you can still recover the majority of the data).

    b) Is there a portability concern? For example, if you store raw double values in the file will it create a problem on other computers that have a different binary format for "double"?

    c) For each type of value; what is the range that actually needs to be represented and what is the precision requirements? Typically software uses "larger and more precise" than necessary (often because its faster to select the next largest type the CPU supports); but for file formats this causes an unnecessary increase in file sizes. For a simple example; maybe you could convert a (64-bit) double into a 32-bit fixed point format and halve the space used while still achieving the range and precision that's actually needed.

    d) Are there "clever" ways to reduce the range and precision required for some of the values? For a simple example; maybe you have "starting value" and "ending value" which both do need 64 bits; but you could convert it into "starting value" and "difference" (so that "ending value" can be calculated as "starting value + difference") where "difference" values have less range and only needs 32 bits to store.

    e) Is any kind of indexing beneficial? For a simple example; if the file might contain 1 million entries and you only want to find one, then you might be able to use the index to find the offset for the entry you want and only load that one entry (and avoid loading all 1 million entries).

    f) What other meta-data might you want? This can be things like a "magic signature" (so that software can check if a file is supposed to comply with the file format and the user didn't give your program the wrong type of file), a "file format version number" (so that the program can do "auto-update to new file format" or at least detect when the file uses an obsolete/deprecated file format that is no longer supported). It can also include information to identify things like who the author was, where the data came from, when the data was obtained, which program created/prepared the file, etc. Sometimes there is also optional data and flags to say if the optional data is/isn't included in the file. You might also want things like "number of entries" and "offset in file for each different area", etc.

    g) What kind of allowances do you need to make for extensibility (and backward compatibility, and forward compatibility)? Typically people leave things like (e.g.) "reserved for future use" fields in headers so that they can add/change/extend the file format in future without breaking everything. Sometimes this is even more specific about what software should do when it sees values in reserved fields that it doesn't support - e.g. "reserved for future use, should be zero, if non-zero software should ignore this value" vs. "reserved for future use, should be zero, if non-zero (due to future use) software should generate an error and not use the file"

    h) Are any kind of compression techniques useful? For a simple example, if you have "6-columns, N-rows" with an index, and sometimes the data for 2 or more rows happens to be the same; then maybe you can only store one copy of the data for those rows and then use the index to figure out which row uses which data (a bit like "row[n] = unique_row_data[ index[n] ]").