I am trying to filter data from large HDF store to the required subset using the attribute where of method read_hdb:
phase = pd.read_hdf(DSPATH + '/phase-table.h5', 'phase', where='EXTSIDNAME=="A"')
According to the documentation I can specify any column defined in the dataset with basic logical conditions. According to Pandas documentation syntax column_name == 'string literal' is supported.
The library throws however a ValueError exception for any column I am trying to specify:
ValueError: The passed where expression: EXTSIDNAME=="A"
contains an invalid variable reference
all of the variable references must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: index,columns
The only condition, which does not through the error is 'index=1'.
The column exists in the data store. If I load it without filter I can see that I am trying to specify in the where condition do exist:
Index(['EXTSIDNAME', 'HOSTNAME', 'TIMESTP', 'SUM_ENDDATE','MODULE_ID','MODULENAME',
'MODULE_STARTDATE', 'MODULE_ENDDATE', 'PHASE_ID','PHASENAME',
'PHASE_STARTDATE', 'PHASE_ENDDATE', 'ID', 'PhaseDuration'], dtype='object')
I am using the latest stable libraries from Anaconda bundle.
If you created the HDF store with to_hdf()
, you need to specify the data_columns
parameter. A similar question is posted here.
An example:
d = {'Col': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'D'],
'X': [1, 2, 3, 4, 5, 6, 7, 8]}
df = pd.DataFrame.from_dict(d)
df
looks like this:
Col X
0 A 1
1 A 2
2 A 3
3 B 4
4 B 5
5 B 6
6 C 7
7 D 8
Let's write it to .h5 file with to_hdf()
. It's important that format equals 'table'
:
df.to_hdf('myhdf.h5', 'somekey', format='table')
Now let's read it with read_hdf()
:
pd.read_hdf('myhdf.h5', key='somekey', where='Col==A')
Returns error:
ValueError: The passed where expression: Col==A
contains an invalid variable reference
all of the variable refrences must be a reference to
an axis (e.g. 'index' or 'columns'), or a data_column
The currently defined references are: index,columns
What gives?
When you do to_hdf()
, you need to also define data_columns
like this:
df.to_hdf('myhdf.h5', 'somekey', format='table', data_columns=['Col', 'X'])
Now you can read data from .h5 file using where
:
pd.read_hdf('myhdf.h5', key='somekey', where='Col==A')
Col X
0 A 1
1 A 2
2 A 3
With where
as a list:
pd.read_hdf('myhdf.h5', key='somekey', where=['Col==A', 'X==2'])
Col X
1 A 2