Search code examples
pythonhdf5h5py

h5py error reading virtual dataset into NumPy array


I'm trying to load data from a virtual HDF dataset created with h5py and having some troubles properly loading the data.

Here is an example of my issue:

import h5py
import tools as ut

virtual  = h5py.File(ut.params.paths.virtual)

a = virtual['part2/index'][:]

print(virtual['part2/index'][-1])
print(a[-1])

This outputs:

[890176134]
[0]

Why? Why is the last element different when I copy the data into a NumPy array (value=[0]) vs when I read directly from the dataset (value=[890176134])?

Am I doing something horribly wrong without realizing it?

Thanks a lot.


Solution

  • Yes, you should get the same values from the Virtual Dataset or an array created from the Virtual Dataset. It's hard to diagnose the error without more details about the data.

    I used the h5py example vds_simple.py to demonstrate how this should behave. Most of the code builds the HDF5 files. The section at end the compares the output. Code below modified from the example to create a variable number of source files (defined by a0=).

    Code to create the 'a0' source files with sample data:

    a0 = 5000
    # create sample data
    data = np.arange(0, 100).reshape(1, 100)
    
    # Create source files (0.h5 to a0.h5)
    for n in range(a0):
        with h5py.File(f"{n}.h5", "w") as f:
            row_data = data + n
            f.create_dataset("data", data=row_data)
    

    Code to define the virtual layout and assemble virtual dataset:

    # Assemble virtual dataset
    layout = h5py.VirtualLayout(shape=(a0, 100), dtype="i4")
    for n in range(a0):
        filename = "{}.h5".format(n)
        vsource = h5py.VirtualSource(filename, "data", shape=(100,))
        layout[n] = vsource
    
    # Add virtual dataset to output file
    with h5py.File("VDS.h5", "w", libver="latest") as f:
        f.create_virtual_dataset("vdata", layout)
    

    Code to read and print the data:

    # read data back
    # virtual dataset is transparent for reader!
    with h5py.File("VDS.h5", "r") as f:
        arr = f["vdata"][:]
    
        print("\nFirst 10 Elements in First Row:")
        print("Virtual dataset:")
        print(f["vdata"][0, :10])
        print("Reading vdata into Array:")
        print(arr[0, :10])
    
        print("Last 10 Elements of Last Row:")
        print("Virtual dataset:")
        print(f["vdata"][-1,-10:])
        print("Reading vdata into Array:")
        print(arr[-1,-10:])    
    

    Output from code above (w/ a0=5000):

    First 10 Elements in First Row:
    Virtual dataset:
    [0 1 2 3 4 5 6 7 8 9]
    Reading vdata into Array:
    [0 1 2 3 4 5 6 7 8 9]
    Last 10 Elements of Last Row:
    Virtual dataset:
    [5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]
    Reading vdata into Array:
    [5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]