Search code examples
pythonarrayshdf5h5pypytables

How do I combine HDF5 arrays into a table?


I have an HDF5 file containing 15 arrays and nothing else. Normally I would query this information using SQL, but the database is down and I have an HDF5 file and PyTables in its place. The only querying I can find on PyTables that outputs a "row" instead of specific elements from a column is done on Tables, not arrays.

Currently I've been creating the table from scratch in its own h5 file, populating each row individually, and flushing every once in a while. This takes a long time, since there are 29 million rows. Here is the code I've been using to create the table:

#Defining Table Structure
table_description = {
        'Column1':tables.FloatCol(),
        'Column2':tables.FloatCol(),
        'Column3':tables.FloatCol(),
         ....
        'Column15':tables.FloatCol()}

#Opening the HDF5 file
hdf5_file = h5py.File('File Path','r')

#Pulling out the arrays (the future columns)
Column1_array = np.array(hdf5_file.get('Column1'))
Column2_array = np.array(hdf5_file.get('Column2'))
Column3_array = np.array(hdf5_file.get('Column3'))
...
Column15_array = np.array(hdf5_file.get('Column15'))

#Creating a New H5 file
new_file = tables.open_file('new_table.h5','w')

#Creating a New Table in the File
tbl = new_file.create_table('/','Big_Table',table_description)

i = 0
row = tbl.row #A row pointer
while i < 29069765: #Since I know the length of the columns, I'm able to just index.
    row['Column1'] = Column1_array[i] #Filling each column in a row.
    row['Column2'] = Column2_array[i] #I have pulled each column out of the HDF5 file,
    row['Column3'] = Column3_array[i] #using h5py. 
    ...
    row['Column15'] = Column15_array[i]
    row.append() #Adding the row to the table
    i += 1
    if math.fmod(i,100) == 0: #Every 100 rows, I flush the table and the file
        tbl.flush()
        h5file.flush()

new_file.close()

I haven't even begun the process of querying it, but I plan to use the Table.where() function on Big_Table.

Is there a faster way to combine all these column arrays into a single table and run multi argument queries on it?


Solution

  • I reworked your example to read each 1-column table from the original file, then write the data to the new HDF5 file with a single table. This uses get_node() to access each table object along with the .read() method to read as a NumPy array. Data is written to the new table with the .modify_column(). Arguments are column= the data (eg Col_array), and colname= the column/field name to write the data (eg, Column#). Also I added a loop. This simplifies the code, and reduces the memory footprint because it only reads and writes one column of data at a time.

    import tables as tb
    import numpy as np
    import h5py
    
    ##Code to create the first HDF5 file used in my example
    #hdf5_file = h5py.File('SO_62782315_1.h5','w')
    #
    #for cnt in range(1,16,1):
    #    arr = np.random.rand(1000)
    #    hdf5_file.create_dataset('Column'+str(cnt),data=arr)
    #hdf5_file.close()
    
    #Defining Table Structure
    table_dt = np.dtype( [ 
                   ('Column1', 'f8'), ('Column2', 'f8'), ('Column3', 'f8'),
                   ('Column4', 'f8'), ('Column5', 'f8'), ('Column6', 'f8'),
                   ('Column7', 'f8'), ('Column8', 'f8'), ('Column9', 'f8'),
                   ('Column10', 'f8'), ('Column11', 'f8'), ('Column12', 'f8'),
                   ('Column13', 'f8'), ('Column14', 'f8'), ('Column15', 'f8') ] )   
    
    #Creating a New H5 file
    new_file = tb.open_file('SO_62782315_2.h5','w')
    
    #Creating a New Table in the File
    tbl = new_file.create_table('/','Big_Table',table_dt)
    # create array of zeros and append to table to allocate space
    table_arr = np.ndarray((1000,15),dtype=table_dt)
    tbl.append(table_arr)   
    
    #Open the existing HDF5 file with h5py
    hdf5_file = h5py.File('SO_62782315_1.h5','r')
    
    for cnt in range(1,16,1):
    # alternate method (easier to program using a loop)
        Col_array = hdf5_file['Column'+str(cnt)][:]
        tbl.modify_column(column=Col_array, colname='Column'+str(cnt))
        h5file.flush()
    
    new_file.close()
    hdf5_file.close()
    

    You can do everything with PyTables (tables) using this small modification to the open the file and read the array data.

    #Open the existing HDF5 file with tables
    hdf5_file = tb.File('SO_62782315_1.h5','r')
    
    for cnt in range(1,16,1):
    # alternate method (easier to program using a loop)
        Col_array = hdf5_file.get_node('/','Column'+str(cnt)).read()
        tbl.modify_column(column=Col_array, colname='Column'+str(cnt))
        new_file.flush()