I have two huge hdf5 files, each with an index of ids, and each containing different information about each of those ids.
I have read one into a small masked dataset (data), using only a select few ids. I now want to add to the dataset, using information about those select ids from one column ('a') of the second hdf5 file (s_data).
Currently I am having to read though the entire 2nd hdf5 file and select ids that match, as per:
for i in range(len(data['ids'])):
print(i)
data['a'][i] = s_data['a'][s_data['ids'] == data['ids'][i]]
Now for 190million ids, this takes an uncomfortably long time. Is here a simpler way to match them? I'm thinking a pandas style join, however I can't find a way for this to work with h5py datasets.
Many thanks in advance!
Have you considered PyTables? It is another Python package to read HDF5 files. It has fast search algorithms based on OPSI (Optimized Partially Sorted Indexes). Using the .read_where()
method with a search condition will simplify the search process and should be faster than h5py.
Your question is similar another I answered last week about finding duplicates. You can read my answer here:
Pytables duplicates 2.5 giga rows
Before searching, I would get an array of unique values from the 'ids' field in 'data' to use in the .read_where()
condition to search 'sdata'.
If I understand your process and data, the code would look like this:
import tables as tb
# need to open HDF5 files
h5f1 = tb.File('yourfile1.h5','r')
h5f2 = tb.File('yourfile2.h5','r')
# define data and sdata datasets:
data = h5f1.root.data
sdata = h5f2.root.sdata
# Step 1: Get a Numpy array of the 'ids' field/column from data DS:
ids_arr = data.read(field='ids')
# Step 2: Get new array with unique values only:
uids_arr = np.unique(ids_arr)
#Or, combine steps 1 and 2 into one line:
uids_arr = np.unique(data.read(field='ids'))
# Step 3a: Loop on rows of unique id values
for id_test in uids_arr :
# Step 3b: Get an array with all rows that match this id value,
# Only returns values in field 'a'
match_row_arr = sdata.read_where('ids==id_test',field='a')