Search code examples
pythonperformancepandashdf5pytables

Performance: list all keys from hdf5 file with pandas


Is it normal that that it takes so long to obtain all present keys in a hdf5 file?

Code Sample:

start = time.time()
store = pd.HDFStore(filepath)
print(time.time() - start)
0.0

start = time.time()
a = store.keys()
print(time.time() - start)
23.874846696853638

len(a) 
80

start = time.time()
store.select(key="/data/table1") # the next table would be /data/table2
print(time.time() - start)

0.062399864196777344

All keys are 'tables' (i.e. not fixed). There are about 80 keys present in the file.

The entire size of the .h5 file is 348 MB. Each table has approx. the same size (after loading to a pandas.DataFrame) of 2.6 MB.

pandas v.0.20.1

tables v.3.2.2.

I am wondering if the key hierarchy is an issue: all in data/table[X] instead of directly into table[X]?


Solution

  • I have the same issue. It appears the cause is related to the way tables checks every single node value to create a list of keys. I've raised this to pandas dev.

    If you want to check whether a key is in the store then

    store.__contains__(key)
    

    will do the job and is much faster.

    https://github.com/pandas-dev/pandas/issues/17593