Storage for numerical data in binary files & data structures

Big topic here and I am a newbie, I am looking for some direction since the possibilities seem endless for this topic.

I am running a numerical simulation, that creates a lot of data and I want to move away from storing it in plain text (once I tried to save all the data created and ended up with a 4TB txt file).

My simulation involves 4 fields over an interval (they are represented each by an array of doubles of typically 4000 to 16000 elements) and they evolve for about 1 million cycles each time, so we are talking in the billions of doubles generated.

Of course I do not save everything each time, instead I use 3 types of file (these files are a mockup for shortness reasons, my actual files are all written in %g format so they take those 7 characters + tabulations):

File that saves the content of the fields in a specific point for all time steps, e.g.:
```
t     Phi    Pi    Delta    A
0     1.3    0.4   0.3      0.99
...
```
File that saves all the fields over all the interval at a certain time step
```
x     Phi   Pi    Delta    A
0     0.0   0.4   0.0      1.0
...
```

File that saves every n steps in time and space

t    x    Phi    Pi    Delta    A
0.0  0.0  0.0    1.3   0.0      1.0
0.0  0.1  0.01   1.2   0.02     0.98
...
0.2  0.0  0.0    1.3   0.0      1.0
0.2  0.1  0.03   1.5   0.01     0.95

I then use this files for various purposes like plotting graphs, doing fourier transforms on them and using them to resume the simulation.

I will eventually need to run this on a cluster, so I am limited to C and I don't know at the moment if they have any database/bigdata system in place.

My questions are:

What is the best format to sotre this data, I assume it's just saving the doubles as raw binary and then write a program to retrive them later, but I am open to suggestions
what is the best way to organize this data? I was looking around and maybe i could write a tree in which the leaves are arrays
What about compression?

Solution

What is the best format to store this data

It depends on the precision, and the structure of the values.

If 7 significant decimal digits of precision suffices, and the values fit within 2^-126 to 2¹²⁷ (1.17549×10^-38 to 1.70141×10³⁸), then you can use the IEEE-754 binary32 format. On all machines and clusters used for high-performance computing, the float type corresponds to this.

If you need 15 significant decimal digits of precision, and/or range from 2^-1023 to 2¹⁰²³ (1.11254×10^-308 to 8.98847×10³⁰⁸), use IEEE-754 binary64 format. Again, on all machines and clusters used for high-performance computing, the double type corresponds to this.

The remaining problem is byte order and field identification.

Assuming you do not wish to expend any HPC resources for data conversion during computation, it is best to store the data in native byte order, but include a header in the file that contains a known "prototype" value for each value type, so that a reader can check them to verify if byte order compensation is needed to correctly interpret the fields; plus descriptors for each of the fields.

(As an example, I've implemented this in a way that allows the files to be easily read in C and native Fortran 95 with minimal compiler extensions, also allowing each compute node to save the results in a local file, with readers automatically obtaining the data from multiple files in parallel. I typically only support u8, s8, u16, s16, u32, s32, u64, s64 (for unsigned and signed integers of various bit sizes), and r32 and r64 for single- and double-precision reals, or Binary32 and Binary64, respectively). I have not needed complex number formats yet.)

Most people prefer to use e.g. NetCDF for this, but my approach differs in that writers produce the data in native format, rather than a normalized format; my intent being minimizing the overhead at data creation/simulation time, and pushing all overhead to the readers.

If you find the small overhead at file generation time (during simulation) to be acceptable, and do not have experience in writing binary file format routines, I do recommend you use NetCDF.

Do note that if HPC cluster operators find your simulation/computation wastes resources (for example, average core CPU load is low, or it does not scale well to multiple cores), you may not be allowed to run your simulation in a cluster. This depends on local politics and policys too, obviously.

What is the best way to organize this data?

Because of the very large amount of data, parallel files may be your best option. (Some clusters have fast local storage, in which case storing data directly from each node to a local file, and collecting those local files in a bunch after the run, may be preferable. As it varies, ask your cluster admin.)

In other words, one file per one related array of data.

It is not difficult to write a library that can read from multiple files in parallel, but correctly parsing and managing structured files is much harder.

Furthermore, splitting the data into separate files often makes data transport easier. If you have a data file 16 TiB in size, you are basically limited to network transport, and may even be limited as to which filesystems you can use. However, if you have say 128 files where each is around 128 GiB in size, you have many more options, and can probably keep some of them in offline storage, while working on others. In particular, many HPC cluster operators will let you transfer the files to local media storage devices (USB3 disks or memory sticks) directly, to reduce network transfer congestion.

What about compression?

You can compress the data if needed, but I personally would do it at the point where the data is collected/combined/processed on your own workstation, not at the point where it is generated. HPC computation is expensive; it is much cheaper to munge the data as you first process it.

The binary data does not compress as well as text does, but the text files are much larger at the same data resolution. That means it is important to choose the correct value type used to store each parameter anyway. And you want to keep that type across the entire set, not change from record to another, to keep processing simple.

As to the compression/decompression algorithms, I'd choose between zlib and xz. See e.g. here for a quick look at the speed/compression ratio curves. Simply put, zlib is fast but provides modest compression ratios, whereas xz is slower but provides much better compression ratios.