Search code examples
pythonnumpymemoryhdf5h5py

Memory error while reading a large .h5 file


I have created a .h5 from numpy array

h5f = h5py.File('/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5', 'w')
h5f.create_dataset('JZ3WPpxpypz', data=all, compression="gzip")

HDF5 dataset "JZ3WPpxpypz": shape (19494500, 376), type "f8"

But I am getting a memory error while reading the .h5 file to a numpy array

filename = '/data/debo/jetAnomaly/AtlasData/dijets/mergedRoot/miniTrees/JZ3W.h5'
h5 = h5py.File(filename,'r')

h5.keys()

[u'JZ3WPpxpypz']

data = h5['JZ3WPpxpypz']

If I try to see the array it gives me memory error

data[:]

---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-33-629f56f97409> in <module>()
----> 1 data[:]

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

h5py/_objects.pyx in h5py._objects.with_phil.wrapper()

/home/debo/env_autoencoder/local/lib/python2.7/site-packages/h5py/_hl/dataset.pyc in __getitem__(self, args)
    560         single_element = selection.mshape == ()
    561         mshape = (1,) if single_element else selection.mshape
--> 562         arr = numpy.ndarray(mshape, new_dtype, order='C')
    563 
    564         # HDF5 has a bug where if the memory shape has a different rank

MemoryError: 

Is there any memory efficient way to read .h5 file into numpy array?

Thanks, Debo.


Solution

  • You don't need to call numpy.ndarray() to get an array. Try this:

    arr = h5['JZ3WPpxpypz'][()]
    # or
    arr = data[()]
    

    Adding [()] returns the entire array (different from your data variable -- it simply references the HDF5 dataset). Either method should give you an array of the same dtype and shape as the original array. You can also use numpy slicing operations to get subsets of the array.

    A clarification is in order. I overlooked that numpy.ndarray() was called as part of the process to print data[()]. Here are type checks to show the difference in the returns from the 2 calls:

    # check type for each variable:
    data = h5['JZ3WPpxpypz']
    print (type(data))
    # versus
    arr = data[()]
    print (type(arr))
    

    Output will look like this:

    <class 'h5py._hl.dataset.Dataset'>
    <class 'numpy.ndarray'>
    

    In general, h5py dataset behavior is similar to numpy arrays (by design). However, they are not the same. When you tried to print the dataset contents with this call (data[()]), h5py tried to convert the dataset to a numpy array in the background with numpy.ndarray(). It would have worked if you had a smaller dataset or sufficient memory.

    My takeaway: calling arr = h5['JZ3WPpxpypz'][()] creates the numpy array with a process that does not call numpy.ndarray().

    When you have very large datasets, you may run into situations where you can't create an array with arr= h5f['dataset'][()] because the dataset is too large to fit into memory as a numpy array. When this occurs, you can create the h5py dataset object, then access subsets of the data with slicing notation, like this trivial example:

    data = h5['JZ3WPpxpypz']
    arr1 = data[0:100000]
    arr2 = data[100000:200000])
    # etc