I have a large number of data frames exported to a series of HDFStore files through Pandas. I need to be able to quickly pull in the most recent record, for each of these dataframes on demand.
The setup:
<class 'pandas.io.pytables.HDFStore'>
File path: /data/storage_X100.hdf
/X1 frame_table (typ->appendable,nrows->2652,ncols->1,indexers->[index])
/XX frame_table (typ->appendable,nrows->2652,ncols->3,indexers->[index])
/Y1 frame_table (typ->appendable,nrows->2652,ncols->2,indexers->[index])
/YY frame_table (typ->appendable,nrows->2652,ncols->3,indexers->[index])
I am storing roughly 100 data frames in each HDF file, and have around 5000 files to run through. Each of the data frames in the HDFStore are indexed with a DateTimeIndex.
For a single file, I'm currently looping through the HDFStore.keys()
, and then querying the dataframe with a tail(1)
like so:
store = pandas.HDFStore(filename)
lastrecs = {}
for key in store.keys():
last = store[key].tail(1)
lastrecs[key] = last
Is there a better way of doing this, perhaps with HDFStore.select_as_multiple
? Even selecting the last record without pulling the entire data frame for a tail would probably speed things up tremendously. How can this be done?
use start
and/or stop
to specify a range of rows. You still need to iterate over the keys, but this will just select the last row of a table, so should be very fast.
In [1]: df = DataFrame(np.random.randn(10,5))
In [2]: df.to_hdf('test.h5','df',mode='w',format='table')
In [3]: store = pd.HDFStore('test.h5')
In [4]: store
Out[4]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable,nrows->10,ncols->5,indexers->[index])
In [5]: nrows = store.get_storer('df').nrows
In [6]: nrows
Out[6]: 10
In [7]: store.select('df',start=nrows-1,stop=nrows)
Out[7]:
0 1 2 3 4
9 0.221869 -0.47866 1.456073 0.093266 -0.456778
In [8]: store.close()
Here's a question using nrows (for a different purpose) here