python arrays numpy machine-learning hdf5

Why can I process a large file only when I don't fix HDF5 deprecation warning?

After receiving the H5pyDeprecationWarning: dataset.value has been deprecated. Use dataset[()] instead. warning, I changed my code to:

import h5py
import numpy as np 

f = h5py.File('myfile.hdf5', mode='r')
foo = f['foo']
bar = f['bar']
N, C, H, W = foo.shape. # (8192, 3, 1080, 1920)
data_foo = np.array(foo[()]) # [()] equivalent to .value

and when I tried to read a (not so) big file of images, I got a Killed: 9 on my terminal, my process was killed because it was consuming too much memory, on the last line of the code, despite that archaic comment of mine there . .

However, my original code:

f = h5py.File('myfile.hdf5', mode='r')
data_foo = f.get('foo').value
# script's logic after that worked, process not killed

worked just fine, except from the issued warning..

Why did my code work?

Solution

Let me explain what your code is doing, and why you are getting memory errors. First some HDF5/h5py basics. (The h5py docs are an excellent starting point. Check here: h5py QuickStart)

foo = f['foo'] and foo = f.get('foo') both return a h5py dataset object named 'foo'.(Note: it's more common to see this as foo = f['foo'], but nothing wrong with the get() method.) A dataset object is not the same as a NumPy array. Datasets behave like NumPy arrays; both have a shape and a data type, and support array-style slicing. However, when you access a dataset object, you do not read all of the data into memory. As a result, they require less memory to access. This is important when working with large datasets!

This statement returns a Numpy array: data_foo = f.get('foo').value. The preferred method is data_foo = f['foo'][:]. (NumPy slicing notation is the way to return a NumPy array from a dataset object. As you discovered, .value is deprecated.)
This also returns a Numpy array: data_foo = foo[()] (assuming foo is defined as above).
So, when you enter this equation data_foo = np.array(foo[()]) you are creating a new NumPy array from another array (foo[()] is the input object). I suspect your process was killed because the amount of memory to create a copy of a (8192, 3, 1080, 1920) array exceeded your system resources. That statement will work for small datasets/arrays. However, it's not good practice.

Here's an example to show how to use the different methods (h5py dataset object vs NumPy array).

h5f = h5py.File('myfile.hdf5', mode='r')

# This returns a h5py object:
foo_ds = h5f['foo']
# You can slice to get elements like this:
foo_slice1 = foo_ds[0,:,:,:] # first row
foo_slice2 = foo_ds[-1,:,:,:] # last row

# This is the recommended method to get a Numpy array of the entire dataset:
foo_arr = h5f['foo'][:]
# or, referencing h5py dataset object above
foo_arr = foo_ds[:] 
# you can also create an array with a slice
foo_slice1 = h5f['foo'][0,:,:,:] 
# is the same as (from above):
foo_slice1 = foo_ds[0,:,:,:]