Search code examples
pythonhdf5large-datah5pypytables

H5py, Merge matched lines from huge hdf5 file to smaller datasets


I have two huge hdf5 files, each with an index of ids, and each containing different information about each of those ids.

I have read one into a small masked dataset (data), using only a select few ids. I now want to add to the dataset, using information about those select ids from one column ('a') of the second hdf5 file (s_data).

Currently I am having to read though the entire 2nd hdf5 file and select ids that match, as per:

for i in range(len(data['ids'])):
        print(i)
        data['a'][i] = s_data['a'][s_data['ids'] == data['ids'][i]]

Now for 190million ids, this takes an uncomfortably long time. Is here a simpler way to match them? I'm thinking a pandas style join, however I can't find a way for this to work with h5py datasets.

Many thanks in advance!


Solution

  • Have you considered PyTables? It is another Python package to read HDF5 files. It has fast search algorithms based on OPSI (Optimized Partially Sorted Indexes). Using the .read_where() method with a search condition will simplify the search process and should be faster than h5py.

    Your question is similar another I answered last week about finding duplicates. You can read my answer here:
    Pytables duplicates 2.5 giga rows

    Before searching, I would get an array of unique values from the 'ids' field in 'data' to use in the .read_where() condition to search 'sdata'. If I understand your process and data, the code would look like this:

    import tables as tb
    # need to open HDF5 files  
    h5f1 = tb.File('yourfile1.h5','r')
    h5f2 = tb.File('yourfile2.h5','r')
    # define data and sdata datasets:
    data  = h5f1.root.data
    sdata = h5f2.root.sdata
    
    # Step 1: Get a Numpy array of the 'ids' field/column from data DS: 
    ids_arr = data.read(field='ids')
    # Step 2: Get new array with unique values only: 
    uids_arr = np.unique(ids_arr)     
    
    #Or, combine steps 1 and 2 into one line: 
    uids_arr = np.unique(data.read(field='ids')) 
    
    # Step 3a: Loop on rows of unique id values 
    for id_test in uids_arr :
    
    # Step 3b: Get an array with all rows that match this id value, 
    #          Only returns values in field 'a' 
         match_row_arr = sdata.read_where('ids==id_test',field='a')