Search code examples
pythonpandashdfshdf5hdfstore

Pandas HDFStore : Omitting Duplicates


I have an HDFStore where i enter data every night. I am wondering if system crash etc, i might rerun processes so I want to make sure that if a row already exists that pandas doesnt include this the next time the process is run. Is there a way to look for duplicates and not include them?


Solution

  • If you have a unique index in your HDFStore you can use the following approach:

    create sample DF:

    In [34]: df = pd.DataFrame(np.random.rand(5,3), columns=list('abc'))
    
    In [35]: df
    Out[35]:
              a         b         c
    0  0.407144  0.972121  0.462502
    1  0.044768  0.165924  0.852705
    2  0.703686  0.156382  0.066925
    3  0.912794  0.362916  0.866779
    4  0.156249  0.625272  0.360799
    

    save it to HDFStore:

    In [36]: store = pd.HDFStore(r'd:/temp/t.h5')
    
    In [37]: store.append('test', df, format='t')
    

    Add a new row to our DF:

    In [38]: df.loc[len(df)] = [-1, -1, -1]
    
    In [39]: df
    Out[39]:
              a         b         c
    0  0.407144  0.972121  0.462502
    1  0.044768  0.165924  0.852705
    2  0.703686  0.156382  0.066925
    3  0.912794  0.362916  0.866779
    4  0.156249  0.625272  0.360799
    5 -1.000000 -1.000000 -1.000000   # new row, which is NOT in the HDF file
    

    select indices of duplicated rows:

    In [40]: idx = store.select('test', where="index in df.index", columns=['index']).index
    

    check:

    In [41]: df.query("index not in @idx")
    Out[41]:
         a    b    c
    5 -1.0 -1.0 -1.0
    

    append to HDFStore only those rows, which weren't yet saved:

    In [42]: store.append('test', df.query("index not in @idx"), format='t')
    

    check:

    In [43]: store.select('test')
    Out[43]:
              a         b         c
    0  0.407144  0.972121  0.462502
    1  0.044768  0.165924  0.852705
    2  0.703686  0.156382  0.066925
    3  0.912794  0.362916  0.866779
    4  0.156249  0.625272  0.360799
    5 -1.000000 -1.000000 -1.000000   # new row has been added