Search code examples
pythonpandashdfstore

Pandas read_hdf() returns "invalid variable reference" on valid where condition


I am trying to filter data from large HDF store to the required subset using the attribute where of method read_hdb:

phase = pd.read_hdf(DSPATH + '/phase-table.h5', 'phase', where='EXTSIDNAME=="A"')

According to the documentation I can specify any column defined in the dataset with basic logical conditions. According to Pandas documentation syntax column_name == 'string literal' is supported.

The library throws however a ValueError exception for any column I am trying to specify:

ValueError: The passed where expression: EXTSIDNAME=="A"
            contains an invalid variable reference
            all of the variable references must be a reference to
            an axis (e.g. 'index' or 'columns'), or a data_column
            The currently defined references are: index,columns

The only condition, which does not through the error is 'index=1'.

The column exists in the data store. If I load it without filter I can see that I am trying to specify in the where condition do exist:

Index(['EXTSIDNAME', 'HOSTNAME', 'TIMESTP', 'SUM_ENDDATE','MODULE_ID','MODULENAME',
'MODULE_STARTDATE', 'MODULE_ENDDATE', 'PHASE_ID','PHASENAME', 
'PHASE_STARTDATE', 'PHASE_ENDDATE', 'ID', 'PhaseDuration'], dtype='object')

I am using the latest stable libraries from Anaconda bundle.


Solution

  • If you created the HDF store with to_hdf(), you need to specify the data_columns parameter. A similar question is posted here.

    An example:

    d = {'Col': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'D'],
         'X': [1, 2, 3, 4, 5, 6, 7, 8]}
    df = pd.DataFrame.from_dict(d)
    

    df looks like this:

      Col  X
    0   A  1
    1   A  2
    2   A  3
    3   B  4
    4   B  5
    5   B  6
    6   C  7
    7   D  8
    

    Let's write it to .h5 file with to_hdf(). It's important that format equals 'table':

    df.to_hdf('myhdf.h5', 'somekey', format='table')
    

    Now let's read it with read_hdf():

    pd.read_hdf('myhdf.h5', key='somekey', where='Col==A')
    

    Returns error:

    ValueError: The passed where expression: Col==A
                contains an invalid variable reference
                all of the variable refrences must be a reference to
                an axis (e.g. 'index' or 'columns'), or a data_column
                The currently defined references are: index,columns
    

    What gives?

    When you do to_hdf(), you need to also define data_columns like this:

    df.to_hdf('myhdf.h5', 'somekey', format='table', data_columns=['Col', 'X'])
    

    Now you can read data from .h5 file using where:

    pd.read_hdf('myhdf.h5', key='somekey', where='Col==A')
      Col  X
    0   A  1
    1   A  2
    2   A  3
    

    With where as a list:

    pd.read_hdf('myhdf.h5', key='somekey', where=['Col==A', 'X==2'])
      Col  X
    1   A  2