Search code examples
pythonarraysnumpymachine-learninghdf5

Why can I process a large file only when I don't fix HDF5 deprecation warning?


After receiving the H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead. warning, I changed my code to:

import h5py
import numpy as np 

f = h5py.File('myfile.hdf5', mode='r')
foo = f['foo']
bar = f['bar']
N, C, H, W = foo.shape. # (8192, 3, 1080, 1920)
data_foo = np.array(foo[()]) # [()] equivalent to .value

and when I tried to read a (not so) big file of images, I got a Killed: 9 on my terminal, my process was killed because it was consuming too much memory, on the last line of the code, despite that archaic comment of mine there . .

However, my original code:

f = h5py.File('myfile.hdf5', mode='r')
data_foo = f.get('foo').value
# script's logic after that worked, process not killed

worked just fine, except from the issued warning..

Why did my code work?


Solution

  • Let me explain what your code is doing, and why you are getting memory errors. First some HDF5/h5py basics. (The h5py docs are an excellent starting point. Check here: h5py QuickStart)

    foo = f['foo'] and foo = f.get('foo') both return a h5py dataset object named 'foo'.(Note: it's more common to see this as foo = f['foo'], but nothing wrong with the get() method.) A dataset object is not the same as a NumPy array. Datasets behave like NumPy arrays; both have a shape and a data type, and support array-style slicing. However, when you access a dataset object, you do not read all of the data into memory. As a result, they require less memory to access. This is important when working with large datasets!

    This statement returns a Numpy array: data_foo = f.get('foo').value. The preferred method is data_foo = f['foo'][:]. (NumPy slicing notation is the way to return a NumPy array from a dataset object. As you discovered, .value is deprecated.)
    This also returns a Numpy array: data_foo = foo[()] (assuming foo is defined as above).
    So, when you enter this equation data_foo = np.array(foo[()]) you are creating a new NumPy array from another array (foo[()] is the input object). I suspect your process was killed because the amount of memory to create a copy of a (8192, 3, 1080, 1920) array exceeded your system resources. That statement will work for small datasets/arrays. However, it's not good practice.

    Here's an example to show how to use the different methods (h5py dataset object vs NumPy array).

    h5f = h5py.File('myfile.hdf5', mode='r')
    
    # This returns a h5py object:
    foo_ds = h5f['foo']
    # You can slice to get elements like this:
    foo_slice1 = foo_ds[0,:,:,:] # first row
    foo_slice2 = foo_ds[-1,:,:,:] # last row
    
    # This is the recommended method to get a Numpy array of the entire dataset:
    foo_arr = h5f['foo'][:]
    # or, referencing h5py dataset object above
    foo_arr = foo_ds[:] 
    # you can also create an array with a slice
    foo_slice1 = h5f['foo'][0,:,:,:] 
    # is the same as (from above):
    foo_slice1 = foo_ds[0,:,:,:]