scientific-computinghdf5# HDF5 Storage Overhead

I'm writing a large number of small datasets to an HDF5 file, and the resulting filesize is about 10x what I would expect from a naive tabulation of the data I'm putting in. My data is organized hierarchically as follows:

```
group 0
-> subgroup 0
-> dataset (dimensions: 100 x 4, datatype: float)
-> dataset (dimensions: 100, datatype: float)
-> subgroup 1
-> dataset (dimensions: 100 x 4, datatype: float)
-> dataset (dimensions: 100, datatype: float)
...
group 1
...
```

Each subgroup should take up 500 * 4 Bytes = 2000 Bytes, ignoring overhead. I don't store any attributes alongside the data. Yet, in testing, I find that each subgroup takes up about 4 kB, or about twice what I would expect. I understand that there is some overhead, but where is it coming from, and how can I reduce it? Is it in representing the group structure?

More information: If I increase the dimensions of the two datasets in each subgroup to 1000 x 4 and 1000, then each subgroup takes up about 22,250 Bytes, rather than the flat 20,000 Bytes I expect. This implies an overhead of 2.2 kB per subgroup, and is consistent with the results I was getting with the smaller dataset sizes. Is there any way to reduce this overhead?

Solution

I'll answer my own question. The overhead involved just in representing the group structure is enough that it doesn't make sense to store small arrays, or to have many groups, each containing only a small amount of data. There does not seem to be any way to reduce the overhead per group, which I measured at about 2.2 kB.

I resolved this issue by combining the two datasets in each subgroup into a (100 x 5) dataset. Then, I eliminated the subgroups, and combined all of the datasets in each group into a 3D dataset. Thus, if I had N subgroups previously, I now have one dataset in each group, with shape (N x 100 x 5). I thus save the N * 2.2 kB overhead that was previously present. Moreover, since HDF5's built-in compression is more effective with larger arrays, I now get a better than 1:1 overall packing ratio, whereas before, overhead took up half the space of the file, and compression was completely ineffective.

The lesson is to avoid complicated group structures in HDF5 files, and to try to combine as much data as possible into each dataset.

- True or false output based on a probability
- multinomial pmf in python scipy/numpy
- How to parallelize the sparse matrix-matrix (SpMM) product with OpenMP for the COO format?
- HDF5 Storage Overhead
- Multi Dimensional numerical integration in Julia
- Discrete Laplacian (del2 equivalent) in Python
- Interactive large plot with ~20 million sample points and gigabytes of data
- Dealing with extremely small/large values in odeint
- Numerical instability of gradient descent in C
- fill nan values with mean or interpolated values in a multidimensional array
- Why does my Python code using scipy.curve_fit() for Planck's Radiation Law produce 'popt=1' and 'pcov=inf' errors?
- ModuleNotFoundError in Python3.9
- Loading large matrix from text file into Java arrays
- Numpy reshaping from flattened
- Processing large amount of data in Python
- Curve fitting for a complex function with 4 parameters (python)
- How do I plot data from multiple CSVs each with different column numbers
- Sympy yields `TypeError: unsupported operand type(s) for *: 'interval' and 'complex'` for complex rational expression
- scipys quadrature function complains about perfectly sane lambda?
- Is C really used for a lot of Scientific Computing?
- Do scripters have to consider roundoff error?
- Program for float and doubles and how to print floats?
- Open source alternative to MATLAB's fmincon function?
- Randomly generate signed permutation matrix in Julia?
- Generate array of complex numbers with absolute value one in Julia?
- Update/Refresh matplotlib plots on second monitor
- Efficient/Cheap way to concatenate arrays in Julia?
- How to use qrfact function in Julia?
- How to find Bragg reflexes fast with numpy?
- Numerically stable softmax