Search code examples
pythonnumpyhdf5h5py

Python - HDF5 to numpy array - out of memory


I have a HDF5 file that contains 20'000'000 rows, each row has 8 float32 columns. The total raw memory size should be approx 640MB.

I want to load this data in my Python app, however, during the load to numpy array I run out of memory (I have 64GB RAM)

I use this code:

import h5py

hf = h5py.File(dataFileName, 'r')
data = hf['data'][:]

For smaller files it works ok, however, my input is not so big as well. So is there any other way how to load the entire dataset to memory because it should fit without any problems. On the other hand, why it takes so much memory? Even if it would internally convert float32 to float64 it is not nearly the size of the entire RAM.

Dataset info from HDFView 3.3.0

enter image description here


Solution

  • Your first sign of trouble could have been that the file, while holding some 500 MiB of data, actually has a size of about 850 MiB; at least when I replicate it on my system. This indicates an excessive amount of overhead.

    The tiny chunk size combined with the rather large data set size apparently breaks the HDF5 library or at least gets it to allocate an enormous amount of memory. As a test, this will consume all memory and swap on my system if I don't kill the process fast enough:

    data = np.random.random((16409916, 8)).astype('f4')
    with h5py.File(outpath, "w") as outfile:
        dset = outfile.create_dataset("data", data=data, chunks=(2, 8))
    

    Meanwhile this will work but is very slow:

    with h5py.File(outpath, "w") as outfile:
        dset = outfile.create_dataset(
                "data", shape=data.shape, dtype=data.dtype, chunks=(2, 8))
        for start in range(0, len(data), 2):
            end = start + 2
            dset[start:end] = data[start:end]
    

    Likewise, you cannot read it all at once with such a ludicrous chunk size. If I had to guess why, the library probably wants to figure out all chunk locations before reading them. This turns a rather compact on-disk representation into a large one in memory.

    Try something like this as a workaround:

    with h5py.File(inpath, "r") as infile:
        dset = infile["data"]
        shape, dtype = dset.shape, dset.dtype
        data = np.empty(shape, dtype)
        raw_chunksize = 1024**2 # 1 MiB
        raw_rowsize = dtype.itemsize * np.prod(shape[1:])
        chunksize = max(1, raw_chunksize // raw_rowsize)
        for start in range(0, len(data), chunksize):
            end = start + chunksize
            data[start:end] = dset[start:end]
    

    Please tell whoever created those files to read up on the meaning of chunk size and choose one appropriately; typically in the range of 64 kiB to 1 MiB.