I've got 1,000+ very long matlab
vectors (varying lengths ~ 10^8 samples) representing data from different patients and sources.
I wish to compactly organize them in one file for a later convenient access in python
.
I want each sample to somehow hold additional information (patient ID, sampling freq etc.).
Order should be:
Hospital 1:
Pat. 1:
vector:sample 1
vector:sample 2
Pat. 2:
vector:sample 1
vector:sample 2
Hospital 2:
Pat. 1:
vector:sample 1
vector:sample 2
.
.
.
I thought about converting samples to hdf5
filetype and add metadata, and then merge several hdf5
files into a single file, but I'm facing difficulties.
already tried:
Open for suggestions!
I see at least 2 approaches with HDF5. You can copy all of your data into a single file. Gigabytes of data is not a problem for HDF5 (given sufficient resources). Alternately, you could save Patient data in separate files, and use External Links to point to the data from a central HDF5 file. After you create the links, you can access the data "as-if" it's in that file. Both methods shown below with small, simple "samples" created using Numpy random. Each sample is a single dataset, and includes attributes with the Hospital, Patient and Sample ID.
Method 1: All data in a single file
num_h = 3
num_p = 5
num_s = 2
with h5py.File('SO_59556149.h5', 'w') as h5f:
for h_cnt in range(num_h):
for p_cnt in range(num_p):
for s_cnt in range(num_s):
ds_name = 'H_' + str(h_cnt) + \
'_P_' + str(p_cnt) + \
'_S_' + str(s_cnt)
# Create sample vector data and add to a dataset
vec_arr = np.random.rand(1000,1)
dset = h5f.create_dataset(ds_name, data=vec_arr )
# add attributes of Hospital, Patient and Sample ID
dset.attrs['Hospital ID']=h_cnt
dset.attrs['Patient ID']=p_cnt
dset.attrs['Sample ID']=s_cnt
Method 2: External links to Patient data in separate files
num_h = 3
num_p = 5
num_s = 2
with h5py.File('SO_59556149_link.h5', 'w') as h5f:
for h_cnt in range(num_h):
for p_cnt in range(num_p):
fname = 'SO_59556149_' + 'H_' + str(h_cnt) + '_P_' + str(p_cnt) + '.h5'
h5f2 = h5py.File(fname, 'w')
for s_cnt in range(num_s):
ds_name = 'H_' + str(h_cnt) + \
'_P_' + str(p_cnt) + \
'_S_' + str(s_cnt)
# Create sample vector data and add to a dataset
vec_arr = np.random.rand(1000,1)
dset = h5f2.create_dataset(ds_name, data=vec_arr )
# add attributes of Hospital, Patient and Sample ID
dset.attrs['Hospital ID']=h_cnt
dset.attrs['Patient ID']=p_cnt
dset.attrs['Sample ID']=s_cnt
h5f[ds_name] = h5py.ExternalLink(fname, ds_name)
h5f2.close()