Search code examples
pythonpandashdf5

Will Pandas auto sort new data appended to a dataframe (in hdf5 format)


For example, I got a df1, index from 20200101 to 20210101, dtype is DateTime, and I already save this df to h5 (format is table). Then I got a new df2, index from 20200101 to 20200201, which is in the middle of the index of df1. If I append df2 to this h5, will it auto sort index?


Solution

  • It's easy enough to write an example to demonstrate the behavior and the solution.

    Code below creates 2 dataframes of odd and even numbers and writes them to an HDF5 file in Table format as key='test_data'. (Table format required to append data.) When you run that, you will see that the odd numbers are first, followed by the even numbers (NOT auto-sorted).

    col2 = [ x for x in range(1,53,2)]
    col3 = [ float(x) for x in range(1,53,2)]
    
    df1 = pd.DataFrame({'col_int': col2, 'col_float': col3})
    df1.to_hdf('SO_70731279.h5',key='test_data',mode='w', format='table',
               data_columns=True)
    
    col2 = [ x for x in range(2,54,2)]
    col3 = [ float(x) for x in range(2,54,2)]
    
    df2 = pd.DataFrame({'col_int': col2, 'col_float': col3})
    df2.to_hdf('SO_70731279.h5',key='test_data',mode='a', format='table',
               data_columns=True, append=True)
    

    This code reads the data from key='test_data' to df3, sorts the dataframe in-place, then writes to the HDF5 file with key='sorted_data'. Run this and you will see this data in the HDF5 file is in ascending order.

    df3 = pd.read_hdf('SO_70731279.h5',key='test_data',columns=['col_int','col_float'])
    df3.sort_values('col_int',inplace=True,ignore_index=True)
    df3.to_hdf('SO_70731279.h5',key='sorted_data',mode='a', format='table',
               data_columns=True)
    

    If you want to REPLACE the data in key='test_data', you have to write df3 to the key and not include append=True. This will overwrite the key.

    df3.to_hdf('SO_70731279.h5',key='test_data',mode='a', format='table',
               data_columns=True)  ## Note append=True is missing on this call
    

    This solution will solve your problem. However, computational efficiency with very large dataframes is unknown. You should test with your data before implementing. Good luck.