Search code examples
pythondataframehdf5h5py

HDF5 tagging datasets to events in other datasets


I am sampling time series data off various machines, and every so often need to collect a large high frequency burst of data from another device and append it to the time series data.

Imagine I am measuring temperature over time, and then every 10 degrees increase in temperature I sample a micro at 200khz, I want to be able to tag the large burst of micro data to a timestamp in the time-series data. Maybe even in the form of a figure.

I was trying to do this with regionref, but am struggling to find a elegant solution. and I'm finding myself juggling between pandas store and h5py and it just feels messy.

Initially I thought I would be able to make separate datasets from the burst-data then use reference or links to timestamps in the time-series data. But no luck so far.

Any way to reference a large packet of data to a timestamp in another pile of data would be appreciated!

table for ref


Solution

  • How did use region references? I assume you had an array of references, with references alternating between a range of "standard rate" and "burst rate" data. That is a valid approach, and it will work. However, you are correct: it's messy to create, and messy to recover the data.

    Virtual Datasets might be a more elegant solution....but tracking and creating the virtual layout definitions could get messy too. :-) However, once you have the virtual data set, you can read it with typical slice notation. HDF5/h5py handles everything under the covers.

    To demonstrate, I created a "simple" example (realizing virtual datasets aren't "simple"). That said, if you can figure out region references, you can figure out virtual datasets. Here is a link to the h5py Virtual Dataset Documentation and Example for details. Here is a short summary of the process:

    1. Define the virtual layout: this is the shape and dtype of the virtual dataset that will point to other datasets.
    2. Define the virtual sources. Each is a reference to a HDF5 file and dataset (1 virtual source for file/dataset combination.)
    3. Map virtual source data to the virtual layout (you can use slice notation, which is shown in my example).
    4. Repeat steps 2 and 3 for all sources (or slices of sources)

    Note: virtual datasets can be in a separate file, or in the same file as the referenced datasets. I will show both in the example. (Once you have defined the layout and sources, both methods are equally easy.)

    There are at least 3 other SO questions and answers on this topic:

    Example follows:
    Step 1: Create some example data. Without your schema, I guessed at how you stored "standard rate" and "burst rate" data. All standard rate data is stored in dataset 'data_log' and each burst is stored in a separate dataset named: 'burst_log_##'.

    import numpy as np
    import h5py
    
    log_ntimes = 31
    log_inc = 1e-3
    
    arr = np.zeros((log_ntimes,2))
    for i in range(log_ntimes):
        time = i*log_inc
        arr[i,0] = time
        #temp = 70.+ 100.*time
        #print(f'For Time = {time:.5f} ; Temp= {temp:.4f}')
    
    arr[:,1] = 70.+ 100.*arr[:,0]
    #print(arr)
    
    with h5py.File('SO_72654160.h5','w') as h5f:
        h5f.create_dataset('data_log',data=arr)
    
    n_bursts = 4    
    burst_ntimes = 11
    burst_inc = 5e-5
    
    for n in range(1,n_bursts):
        arr = np.zeros((burst_ntimes-1,2))
        for i in range(1,burst_ntimes):
            burst_time = 0.01*(n)
            time = burst_time + i*burst_inc
            arr[i-1,0] = time
            #temp = 70.+ 100.*t
        arr[:,1] = 70.+ 100.*arr[:,0]    
        
        with h5py.File('SO_72654160.h5','a') as h5f:
            h5f.create_dataset(f'burst_log_{n:02}',data=arr)
    

    Step 2: This is where the virtual layout and sources are defined and used to create the virtual dataset. This creates a virtual dataset a new file, and one in the existing file. (The statements are identical except for the file name and mode.)

    source_file = 'SO_72654160.h5'
    
    a0 = 0
    with h5py.File(source_file, 'r') as h5f:
        for ds_name in h5f:
            a0 += h5f[ds_name].shape[0]
    
    print(f'Total data rows in source = {a0}')
    
    # alternate getting data from
    #   dataset: data_log, get rows 0-11, 11-21, 21-31 
    #   datasets: burst_log_01, burst log_02, etc (each has 10 rows)
    
    # Define virtual dataset layout
    layout = h5py.VirtualLayout(shape=(a0, 2),dtype=float)
    
    # Map virstual dataset to logged data 
    vsource1 = h5py.VirtualSource(source_file, 'data_log', shape=(41,2))
    layout[0:11,:] = vsource1[0:11,:]
    vsource2 = h5py.VirtualSource(source_file, 'burst_log_01', shape=(10,2))
    layout[11:21,:] = vsource2
    
    layout[21:31,:] = vsource1[11:21,:]
    vsource2 = h5py.VirtualSource(source_file, 'burst_log_02', shape=(10,2))
    layout[31:41,:] = vsource2
    
    layout[41:51,:] = vsource1[21:31,:]
    vsource2 = h5py.VirtualSource(source_file, 'burst_log_03', shape=(10,2))
    layout[51:61,:] = vsource2
       
    # Create NEW file, then add virtual dataset
    with h5py.File('SO_72654160_VDS.h5', 'w') as h5vds:
        h5vds.create_virtual_dataset("vdata", layout)
        print(f'Total data rows in VDS 1 = {h5vds["vdata"].shape[0]}')
    
    # Open EXISTING file, then add virtual dataset 
    with h5py.File('SO_72654160.h5', 'a') as h5vds:
        h5vds.create_virtual_dataset("vdata", layout)
        print(f'Total data rows in VDS 2 = {h5vds["vdata"].shape[0]}')