HDF5 tagging datasets to events in other datasets

I am sampling time series data off various machines, and every so often need to collect a large high frequency burst of data from another device and append it to the time series data.

Imagine I am measuring temperature over time, and then every 10 degrees increase in temperature I sample a micro at 200khz, I want to be able to tag the large burst of micro data to a timestamp in the time-series data. Maybe even in the form of a figure.

I was trying to do this with regionref, but am struggling to find a elegant solution. and I'm finding myself juggling between pandas store and h5py and it just feels messy.

Initially I thought I would be able to make separate datasets from the burst-data then use reference or links to timestamps in the time-series data. But no luck so far.

Any way to reference a large packet of data to a timestamp in another pile of data would be appreciated!

Solution

How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.

Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.

To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:

Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
Repeat steps 2 and 3 for all sources (or slices of sources)

Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)

There are at least 3 other SO questions and answers on this topic:

Example follows:
Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log' and each burst is stored in a separate dataset named: 'burst_log_##'.

import numpy as np
import h5py

log_ntimes = 31
log_inc = 1e-3

arr = np.zeros((log_ntimes,2))
for i in range(log_ntimes):
    time = i*log_inc
    arr[i,0] = time
    #temp = 70.+ 100.*time
    #print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')

arr[:,1] = 70.+ 100.*arr[:,0]
#print(arr)

with h5py.File('SO_72654160.h5','w') as h5f:
    h5f.create_dataset('data_log',data=arr)

n_bursts = 4    
burst_ntimes = 11
burst_inc = 5e-5

for n in range(1,n_bursts):
    arr = np.zeros((burst_ntimes-1,2))
    for i in range(1,burst_ntimes):
        burst_time = 0.01*(n)
        time = burst_time + i*burst_inc
        arr[i-1,0] = time
        #temp = 70.+ 100.*t
    arr[:,1] = 70.+ 100.*arr[:,0]    
    
    with h5py.File('SO_72654160.h5','a') as h5f:
        h5f.create_dataset(f'burst_log_{n:02}',data=arr)

Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)

source_file = 'SO_72654160.h5'

a0 = 0
with h5py.File(source_file, 'r') as h5f:
    for ds_name in h5f:
        a0 += h5f[ds_name].shape[0]

print(f'Total data rows in source = {a0}')

# alternate getting data from
#   dataset: data_log, get rows 0-11, 11-21, 21-31 
#   datasets: burst_log_01, burst log_02, etc (each has 10 rows)

# Define virtual dataset layout
layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)

# Map virstual dataset to logged data 
vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
layout[0:11,:] = vsource1[0:11,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
layout[11:21,:] = vsource2

layout[21:31,:] = vsource1[11:21,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
layout[31:41,:] = vsource2

layout[41:51,:] = vsource1[21:31,:]
vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
layout[51:61,:] = vsource2
   
# Create NEW file, then add virtual dataset
with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
    h5vds.create_virtual_dataset("vdata", layout)
    print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')

# Open EXISTING file, then add virtual dataset 
with h5py.File('SO_72654160.h5', 'a') as h5vds:
    h5vds.create_virtual_dataset("vdata", layout)
    print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')