Given a pandas HDFStore containing a DataFrame
:
import pandas as pd
import numpy.random as rd
df = pd.DataFrame(rd.randn(int(1000)).reshape(500, 2), columns=list('ab'))
store = pd.HDFStore('store.h5')
store.append('df', df, data_columns=['a', 'b'])
I can use the select
function to retrieve a subset of the data, like so:
store.select('df', ['a > 0', 'b > 0'])
However, I can get the same output by falling back to the kind of command I might use if I were using a plain DataFrame
that was not in an HDFStore
:
store.df[(store.df.a > 0) & (store.df.b > 0)]
Is there a difference between these two approaches? If so, what is the difference?
If you run some benchmark, you'll find the following
%timeit store.select('df', ['a > 0', 'b > 0'])
100 loops, best of 3: 2.63 ms per loop
%timeit store.df[(store.df.a > 0) & (store.df.b > 0)]
100 loops, best of 3: 6.01 ms per loop
This suggest that the first select access the file fewer times than the second method. Specifically each of the following
%timeit store.df.a > 0
100 loops, best of 3: 1.84 ms per loop
%timeit store.df.b > 0
1000 loops, best of 3: 1.82 ms per loop
take about 2ms, then you need to select according to a logical AND of both. Only, then you need to apply the final filter. The select instead, access the data only once!