Search code examples
pythonpandashdf5hdfstore

Storing multiple objects in an HDFStore group


I want to store multiple objects in an HDFStore, but I want to organize it by grouping. Something along the lines of:

import pandas as pd
my_store = pd.HDFStore('my_local_store.h5')
my_store._handle.createGroup('/', 'data_source_1') # this works, but I'm not sure what it does
my_store['/data_source_1']['part-1'] = pd.DataFrame({'b':[1,2,9,2,3,5,2,5]}) # this does not work
my_store['/data_source_1']['part-2'] = pd.DataFrame({'b':[3,8,4,2,5,5,6,1]}) # this does not work either

Solution

  • try this:

    my_store['/data_source_1/part-1'] = ...
    

    demo:

    In [13]: store = pd.HDFStore('c:/temp/stocks.h5')
    
    In [15]: store['/aaa/bbb'] = df
    
    In [17]: store.groups
    Out[17]:
    <bound method HDFStore.groups of <class 'pandas.io.pytables.HDFStore'>
    File path: c:/temp/stocks.h5
    /aaa/bbb            frame        (shape->[3,7])
    /stocks             wide_table   (typ->appendable,nrows->6,ncols->3,indexers->[major_axis,minor_axis],dc->[AAPL,ABC,GOOG])>
    
    In [18]: store['/aaa/bbb2'] = df
    
    In [20]: store.items
    Out[20]:
    <bound method HDFStore.items of <class 'pandas.io.pytables.HDFStore'>
    File path: c:/temp/stocks.h5
    /aaa/bbb             frame        (shape->[3,7])
    /aaa/bbb2            frame        (shape->[3,7])
    /stocks              wide_table   (typ->appendable,nrows->6,ncols->3,indexers->[major_axis,minor_axis],dc->[AAPL,ABC,GOOG])>
    

    UPDATE:

    In [29]: store.get_node('/aaa')
    Out[29]:
    /aaa (Group) ''
      children := ['bbb' (Group), 'bbb2' (Group)]
    

    PS AFAIK Pandas considers key (/aaa/bbb) as a full path

    UPDATE2: listing the store:

    we have the following store:

    In [19]: store
    Out[19]:
    <class 'pandas.io.pytables.HDFStore'>
    File path: D:\temp\.data\hdf\test_groups.h5
    /data_source_1/subdir1/1            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index])
    /data_source_1/subdir1/2            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index])
    /data_source_1/subdir1/3            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index])
    /data_source_1/subdir1/4            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index])
    /data_source_1/subdir1/5            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index])
    /data_source_1/subdir2/1            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    /data_source_1/subdir2/2            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    /data_source_1/subdir2/3            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    /data_source_1/subdir2/4            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    /data_source_1/subdir2/5            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    /data_source_1/subdir2/6            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    /data_source_1/subdir2/7            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    /data_source_1/subdir2/8            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    /data_source_1/subdir2/9            frame_table  (typ->appendable,nrows->10,ncols->3,indexers->[index],dc->[a,b,c])
    

    lets find all entries in /data_source_1/subdir2:

    In [20]: [s for s in store if s.startswith('/data_source_1/subdir2/')]
    Out[20]:
    ['/data_source_1/subdir2/1',
     '/data_source_1/subdir2/2',
     '/data_source_1/subdir2/3',
     '/data_source_1/subdir2/4',
     '/data_source_1/subdir2/5',
     '/data_source_1/subdir2/6',
     '/data_source_1/subdir2/7',
     '/data_source_1/subdir2/8',
     '/data_source_1/subdir2/9']
    

    and having the keys you can easily select data:

    In [25]: dfs = [store.select(s, where='a > 5') for s in store if s.startswith('/data_source_1/subdir2/')]
    
    In [26]: [len(df) for df in dfs]
    Out[26]: [5, 5, 5, 5, 5, 5, 5, 5, 5]
    
    In [29]: dfs = [store.select(s, where='a > 7') for s in store if s.startswith('/data_source_1/subdir2/')]
    
    In [30]: [len(df) for df in dfs]
    Out[30]: [4, 4, 4, 4, 4, 4, 4, 4, 4]