python multidimensional-array reshape numpy-ndarray hdf5

3-dimensional array reshaping? HDF5 dataset type?

I have data in the following shape: (127260, 2, 1250)

The type of this data is <HDF5 dataset "data": shape (127260, 2, 1250), type "<f8">

The first dimension (127260) is the number of signals, the second dimension (2) is the type of signal, and the third dimension (1250) is the amount of points in each of the signals.

What I wanted to do is reduce the amount of points for each signal, cut them in half, leave 625 points on each signal, and then have double the amount of signals.

How to convert HDF5 dataset to something like numpy array and how to do this reshape?

Solution

If I understand, you want a new dataset with shape: (2*127260, 2, 625). If so, it's fairly simple to read 2 slices of the dataset into 2 NumPy arrays, create a new array from the slices, then write to a new dataset. Note: reading slices is simple and fast. I would leave the data as-is and do this on-the-fly unless you have a compelling reason to create a new dataset

Code to do this (where h5f is the h5py file object):

new_arr = np.empty((2*127260, 2, 625))
arr1 = h5f['dataset_name'][:,:, :625]
arr2 = h5f['dataset_name'][:,:,  625:]
new_arr[:127260,:,:] = arr1 
new_arr[127260:,:,:] = arr2 
h5f.create_dataset('new_dataset_name',data=new_arr)

Alternately you can do this (and combine 2 steps):

new_arr = np.empty((2*127260, 2, 625))
new_arr[:127260,:,:] = h5f['dataset_name'][:,:, :625]
new_arr[127260:,:,:] = h5f['dataset_name'][:,:,  625:]
h5f.create_dataset('new_dataset_name',data=new_arr)

Here is a 3rd method. It is the most direct way, and reduces the memory overhead. This is important when you have very large datasets that won't fit in memory.

h5f.create_dataset('new_dataset_name',shape=(2*127260, 2, 625),dtype=float)
h5f['new_dataset_name'][:127260,:,:] = h5f['dataset_name'][:,:, :625]
h5f['new_dataset_name'][127260:,:,:] = h5f['dataset_name'][:,:,  625:]

Whichever method you choose, I suggest adding an attribute to note the data source for future reference:

h5f['new_dataset_name'].attrs['Data Source'] = 'data sliced from dataset_name'