Search code examples
pythonpandashdfstoreblaze

Maintain data columns when converting pandas hdfstore with odo


I'm using odo from the blaze project to merge multiple pandas hdfstore tables following the suggestion in this question: Concatenate two big pandas.HDFStore HDF5 files

The stores have identical columns and non-overlapping indicies by design and a few million rows. The individual files may fit into memory but the total combined file probably will not.

Is there a way I can preserve the settings the hdfstore was created with? I loose the data columns and compression settings.

I tried odo(part, whole, datacolumns=['col1','col2']) without luck.

Alternatively, any suggestions for alternative methods would be appreciated. I could of course do this manually but then I have to manage the chunksizing in order to not run out of memory.


Solution

  • odo doesn't support propogation of compression and/or data_columns ATM. Both are pretty easy to add, I created an issue here

    You can do this in pandas this way:

    In [1]: df1 = DataFrame({'A' : np.arange(5), 'B' : np.random.randn(5)})
    
    In [2]: df2 = DataFrame({'A' : np.arange(5)+10, 'B' : np.random.randn(5)})
    
    In [3]: df1.to_hdf('test1.h5','df',mode='w',format='table',data_columns=['A'])
    
    In [4]: df2.to_hdf('test2.h5','df',mode='w',format='table',data_columns=['A'])
    

    Iterate over the input files. Chunk read/write to the final store. Note that you have to specify the data_columns here as well.

    In [7]: for f in ['test1.h5','test2.h5']:
       ...:     for df in pd.read_hdf(f,'df',chunksize=2):
       ...:         df.to_hdf('test3.h5','df',format='table',data_columns=['A'])
       ...:         
    
    In [8]: with pd.HDFStore('test3.h5') as store:
        print store
       ...:     
    <class 'pandas.io.pytables.HDFStore'>
    File path: test3.h5
    /df            frame_table  (typ->appendable,nrows->1,ncols->2,indexers->[index],dc->[A])