How to merge different matlab mat files holding metadata to use in python?

I've got 1,000+ very long matlab vectors (varying lengths ~ 10^8 samples) representing data from different patients and sources. I wish to compactly organize them in one file for a later convenient access in python. I want each sample to somehow hold additional information (patient ID, sampling freq etc.).

Order should be:

Hospital 1:
   Pat. 1:
      vector:sample 1
      vector:sample 2

   Pat. 2:
      vector:sample 1
      vector:sample 2


Hospital 2:
   Pat. 1:
      vector:sample 1
      vector:sample 2
    .
    .
    .

I thought about converting samples to hdf5 filetype and add metadata, and then merge several hdf5 files into a single file, but I'm facing difficulties.

already tried:

matlab: High-level hdf5 matlab functions.
matlab: saving variables as v7.3 mat (hdf5 in fact)
python: sidekit_io.h5merge

Open for suggestions!

Solution

I see at least 2 approaches with HDF5. You can copy all of your data into a single file. Gigabytes of data is not a problem for HDF5 (given sufficient resources). Alternately, you could save Patient data in separate files, and use External Links to point to the data from a central HDF5 file. After you create the links, you can access the data "as-if" it's in that file. Both methods shown below with small, simple "samples" created using Numpy random. Each sample is a single dataset, and includes attributes with the Hospital, Patient and Sample ID.

Method 1: All data in a single file

num_h = 3
num_p = 5
num_s = 2

with h5py.File('SO_59556149.h5', 'w') as h5f:

    for h_cnt in range(num_h):
        for p_cnt in range(num_p):
            for s_cnt in range(num_s):
                ds_name = 'H_' + str(h_cnt) + \
                          '_P_' + str(p_cnt) + \
                          '_S_' + str(s_cnt)
                # Create sample vector data and add to a dataset
                vec_arr = np.random.rand(1000,1)
                dset = h5f.create_dataset(ds_name, data=vec_arr )
                # add attributes of Hospital, Patient and Sample ID
                dset.attrs['Hospital ID']=h_cnt
                dset.attrs['Patient ID']=p_cnt
                dset.attrs['Sample ID']=s_cnt

Method 2: External links to Patient data in separate files

num_h = 3
num_p = 5
num_s = 2

with h5py.File('SO_59556149_link.h5', 'w') as h5f:

    for h_cnt in range(num_h):
        for p_cnt in range(num_p):
            fname = 'SO_59556149_' + 'H_' + str(h_cnt) + '_P_' + str(p_cnt) + '.h5'
            h5f2 = h5py.File(fname, 'w')
            for s_cnt in range(num_s):
                ds_name = 'H_' + str(h_cnt) + \
                          '_P_' + str(p_cnt) + \
                          '_S_' + str(s_cnt)
                # Create sample vector data and add to a dataset
                vec_arr = np.random.rand(1000,1)
                dset = h5f2.create_dataset(ds_name, data=vec_arr )
            # add attributes of Hospital, Patient and Sample ID
                dset.attrs['Hospital ID']=h_cnt
                dset.attrs['Patient ID']=p_cnt
                dset.attrs['Sample ID']=s_cnt
                h5f[ds_name] = h5py.ExternalLink(fname, ds_name)
            h5f2.close()