Search code examples
pythonhdf5pytablesbigdata

goes out of memory when saving large array with HDF5 (Python, PyTables)


Hei folks,

I've got a python process which generates matrices. These are stacked up one onto each other and saved as a tensor. Here is the code

import tables
h5file = tables.open_file("data/tensor.h5", mode="w", title="tensor")
atom = tables.Atom.from_dtype(n.dtype('int16'))
tensor_shape = (N, 3, MAT_SIZE, MAT_SIZE)

for i in range(N):
    mat = generate(i)
    tensor[i, :, :] = mat

The problem is that when it hits 8GB is goes out of memory. Shouldn't the HDF5 format never go out of memory? Like move the data from the memory to the disk when required?


Solution

  • When you are using PyTables the HDF5 file is kept in-memory until the file is closed (see more here: In-memory HDF5 files).

    I will recommend you to have a look at the append and flush methods of PyTables, as I think that's exactly what you want. Be aware that flushing the buffer for every loop iteration will significantly reduce the performance of your code, due to the constant I/O that needs to be performed.

    Also writing the file as chunks (just like when reading data into dataframes in pandas) might spike your interest - See more here: PyTables optimization