Search code examples
pythonmatlabbigdatahdf5merging-data

How to merge different matlab mat files holding metadata to use in python?


I've got 1,000+ very long matlab vectors (varying lengths ~ 10^8 samples) representing data from different patients and sources. I wish to compactly organize them in one file for a later convenient access in python. I want each sample to somehow hold additional information (patient ID, sampling freq etc.).

Order should be:

Hospital 1:
   Pat. 1:
      vector:sample 1
      vector:sample 2

   Pat. 2:
      vector:sample 1
      vector:sample 2


Hospital 2:
   Pat. 1:
      vector:sample 1
      vector:sample 2
    .
    .
    .

I thought about converting samples to hdf5 filetype and add metadata, and then merge several hdf5 files into a single file, but I'm facing difficulties.

already tried:

Open for suggestions!


Solution

  • I see at least 2 approaches with HDF5. You can copy all of your data into a single file. Gigabytes of data is not a problem for HDF5 (given sufficient resources). Alternately, you could save Patient data in separate files, and use External Links to point to the data from a central HDF5 file. After you create the links, you can access the data "as-if" it's in that file. Both methods shown below with small, simple "samples" created using Numpy random. Each sample is a single dataset, and includes attributes with the Hospital, Patient and Sample ID.

    Method 1: All data in a single file

    num_h = 3
    num_p = 5
    num_s = 2
    
    with h5py.File('SO_59556149.h5', 'w') as h5f:
    
        for h_cnt in range(num_h):
            for p_cnt in range(num_p):
                for s_cnt in range(num_s):
                    ds_name = 'H_' + str(h_cnt) + \
                              '_P_' + str(p_cnt) + \
                              '_S_' + str(s_cnt)
                    # Create sample vector data and add to a dataset
                    vec_arr = np.random.rand(1000,1)
                    dset = h5f.create_dataset(ds_name, data=vec_arr )
                    # add attributes of Hospital, Patient and Sample ID
                    dset.attrs['Hospital ID']=h_cnt
                    dset.attrs['Patient ID']=p_cnt
                    dset.attrs['Sample ID']=s_cnt
    

    Method 2: External links to Patient data in separate files

    num_h = 3
    num_p = 5
    num_s = 2
    
    with h5py.File('SO_59556149_link.h5', 'w') as h5f:
    
        for h_cnt in range(num_h):
            for p_cnt in range(num_p):
                fname = 'SO_59556149_' + 'H_' + str(h_cnt) + '_P_' + str(p_cnt) + '.h5'
                h5f2 = h5py.File(fname, 'w')
                for s_cnt in range(num_s):
                    ds_name = 'H_' + str(h_cnt) + \
                              '_P_' + str(p_cnt) + \
                              '_S_' + str(s_cnt)
                    # Create sample vector data and add to a dataset
                    vec_arr = np.random.rand(1000,1)
                    dset = h5f2.create_dataset(ds_name, data=vec_arr )
                # add attributes of Hospital, Patient and Sample ID
                    dset.attrs['Hospital ID']=h_cnt
                    dset.attrs['Patient ID']=p_cnt
                    dset.attrs['Sample ID']=s_cnt
                    h5f[ds_name] = h5py.ExternalLink(fname, ds_name)
                h5f2.close()