Search code examples
pythonpandashdfstore

Pandas HDFStore of MultiIndex DataFrames: how to efficiently get all indexes


In Pandas, is there a way to efficiently pull out all the MultiIndex indexes present in an HDFStore in table format?

I can select() efficiently using where=, but I want all indexes, and none of the columns. I can also select() using iterator=True to save RAM, but that still means reading pretty much all the table from disk, so it's still slow.

I have been hunting in the store.root..table.* stuff, hoping that I can get a list of index values. Am I on the right track?

Plan B would be to keep a shorter MultiIndex DataFrame that just contains empty DataFrames appended every time I append the main one. I can retrieve that and get the index much more cheaply than the main one. Inelegant though.


Solution

  • Create a multi-index df

    In [35]: df = DataFrame(randn(100000,3),columns=list('ABC'))
    
    In [36]: df['one'] = 'foo'
    
    In [37]: df['two'] = 'bar'
    
    In [38]: df.ix[50000:,'two'] = 'bah'
    
    In [40]: mi = df.set_index(['one','two'])
    
    In [41]: mi
    Out[41]: 
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
    Data columns (total 3 columns):
    A    100000  non-null values
    B    100000  non-null values
    C    100000  non-null values
    dtypes: float64(3)
    

    Store it as a table

    In [42]: store = pd.HDFStore('test.h5',mode='w')
    
    In [43]: store.append('df',mi)
    

    get_storer will return the stored object (but not retrieve the data)

    In [44]: store.get_storer('df').levels
    Out[44]: ['one', 'two']
    
    In [2]: store
    Out[2]: 
    <class 'pandas.io.pytables.HDFStore'>
    File path: test.h5
    /df            frame_table  (typ->appendable_multi,nrows->100000,ncols->5,indexers->[index],dc->[two,one])
    

    The index levels are created as data_columns, meaning you can use them in selections This is how to select only the index

    In [48]: store.select('df',columns=['one'])
    Out[48]: 
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
    Empty DataFrame
    

    To select a single column and return it as a mi-frame

    In [49]: store.select('df',columns=['A'])
    Out[49]: 
    <class 'pandas.core.frame.DataFrame'>
    MultiIndex: 100000 entries, (foo, bar) to (foo, bah)
    Data columns (total 1 columns):
    A    100000  non-null values
    dtypes: float64(1)
    

    To select a single column as a Series (which can also be the index as these are stored as columns). This will be quite fast.

    In [2]: store.select_column('df','one')
    Out[2]: 
    0     foo
    1     foo
    2     foo
    3     foo
    4     foo
    5     foo
    6     foo
    7     foo
    8     foo
    9     foo
    10    foo
    11    foo
    12    foo
    13    foo
    14    foo
    ...
    99985    foo
    99986    foo
    99987    foo
    99988    foo
    99989    foo
    99990    foo
    99991    foo
    99992    foo
    99993    foo
    99994    foo
    99995    foo
    99996    foo
    99997    foo
    99998    foo
    99999    foo
    Length: 100000, dtype: object
    

    If you really want the fastest selection of only the index

    In [4]: %timeit store.select_column('df','one')
    100 loops, best of 3: 8.71 ms per loop
    
    In [5]: %timeit store.select('df',columns=['one'])
    10 loops, best of 3: 43 ms per loop
    

    Or to get a complete index

    In [6]: def f():
       ...:     level_1 =  store.select_column('df','one')
       ...:     level_2 =  store.select_column('df','two')
       ...:     return MultiIndex.from_arrays([ level_1, level_2 ])
       ...: 
    
    In [17]: %timeit f()
    10 loops, best of 3: 28.1 ms per loop
    

    If you want the values for each levels, a pretty fast way of doing it

    In [2]: store.select_column('df','one').unique()
    Out[2]: array(['foo'], dtype=object)
    
    In [3]: store.select_column('df','two').unique()
    Out[3]: array(['bar', 'bah'], dtype=object)