Search code examples
pandashdfstore

HDFStore Term memory efficient way to check for membership in list


I have a pandas HDFStore that I am try to select from. I would like to select data between a two timestamps with an id in a large np.array. The following code works but takes up too much memory only when queried for membership in a list. If I use a datetimeindex and a range, the memory footprint is 95% less.

#start_ts, end_ts are timestamps
#instruments is an array of python objects

not_memory_efficient = adj_data.select("US", [Term("date",">=", start_ts),
                              Term("date", "<=", end_ts),
                              Term("id", "=", instruments)])
memory_efficient = adj_data.select("US", [Term("date",">=", start_ts),
                              Term("date", "<=", end_ts),)

Is there a more memory efficient way to do this in HDFStore? Should I set the index to the "sec_id"? (I can also use the chunksize option and concat myself, but that seems to be a bit of a hack.)

Edits:

The hdfstore is created by pd.HDFStore creating a dataframe and storing such as this. I made a mistake earlier

def write_data(country_data, store_file):
    for country in country_data:
        if len(country_data[country]) == 0:
            continue
        df = pd.concat(country_data[country], ignore_index=True)
        country_data[country] = []
        store_file.append(country, df, format="t")

As requested, here is the ptdump for this table: https://gist.github.com/MichaelWS/7980846 also, here is the df: https://gist.github.com/MichaelWS/7981451


Solution

  • To memorialize this for other users.

    In HDFStore, is required to designate certain columns as data_columns if they are not the index in order to later query then.

    Docs are here

    Create a frame

    In [23]: df = DataFrame(dict(date = pd.date_range('20130101',periods=10), id = list('abcabcabcd'), C = np.random.randn(10)))
    
    
    In [28]: df
    Out[28]: 
              C                date id
    0  0.605701 2013-01-01 00:00:00  a
    1  0.451346 2013-01-02 00:00:00  b
    2  0.479483 2013-01-03 00:00:00  c
    3 -0.012589 2013-01-04 00:00:00  a
    4 -0.028552 2013-01-05 00:00:00  b
    5  0.737100 2013-01-06 00:00:00  c
    6 -1.050292 2013-01-07 00:00:00  a
    7  0.137444 2013-01-08 00:00:00  b
    8 -0.327491 2013-01-09 00:00:00  c
    9 -0.660220 2013-01-10 00:00:00  d
    
    [10 rows x 3 columns]
    

    Save to hdf WITHOUT data_columns

    In [24]: df.to_hdf('test.h5','df',mode='w',format='table')
    

    0.13 will report this error (0.12 will just silently ignore)

    In [25]: pd.read_hdf('test.h5','df',where='date>20130101 & date<20130105 & id=["b","c"]')
     ValueError: The passed where expression: date>20130101 & date<20130105 & id=["b","c"]
                contains an invalid variable reference
                all of the variable refrences must be a reference to
                an axis (e.g. 'index' or 'columns'), or a data_column
                The currently defined references are: index,columns
    

    Set all the columns as data columns (can also be a specific list of columns)

    In [26]: df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
    
    In [27]: pd.read_hdf('test.h5','df',where='date>20130101 & date<20130105 & id=["b","c"]')
    Out[27]: 
              C                date id
    1  0.451346 2013-01-02 00:00:00  b
    2  0.479483 2013-01-03 00:00:00  c
    
    [2 rows x 3 columns]
    

    Here is a the Table node of ptdump -av of the file:

    /df/table (Table(10,)) ''
      description := {
      "index": Int64Col(shape=(), dflt=0, pos=0),
      "C": Float64Col(shape=(), dflt=0.0, pos=1),
      "date": Int64Col(shape=(), dflt=0, pos=2),
      "id": StringCol(itemsize=1, shape=(), dflt='', pos=3)}
      byteorder := 'little'
      chunkshape := (2621,)
      autoindex := True
      colindexes := {
        "date": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "C": Index(6, medium, shuffle, zlib(1)).is_csi=False,
        "id": Index(6, medium, shuffle, zlib(1)).is_csi=False}
      /df/table._v_attrs (AttributeSet), 19 attributes:
       [CLASS := 'TABLE',
        C_dtype := 'float64',
        C_kind := ['C'],
        FIELD_0_FILL := 0,
        FIELD_0_NAME := 'index',
        FIELD_1_FILL := 0.0,
        FIELD_1_NAME := 'C',
        FIELD_2_FILL := 0,
        FIELD_2_NAME := 'date',
        FIELD_3_FILL := '',
        FIELD_3_NAME := 'id',
        NROWS := 10,
        TITLE := '',
        VERSION := '2.7',
        date_dtype := 'datetime64',
        date_kind := ['date'],
        id_dtype := 'string8',
        id_kind := ['id'],
        index_kind := 'integer']
    

    The key thing to note is that the data_columns are separate in the 'description', AND they are setup as indexes.