Search code examples
pythonpandaslarge-data

How to detect that large pandas dataframe has different data then before


I process large >1GB csv files with pandas. The script should detect if the data in dataframe is different to the data in the previos run. I can't store the past dataframe. I'm looking for fast function which returns a kind of hash value from pandas dataframe. So that I could store and compare only those "hash-like" values.


Solution

  • import joblib
    joblib.hash(df)
    

    This should work?

    Also this undocumented hash exists in pandas 20.1:

    from pandas.util import hash_pandas_object
    h = hash_pandas_object(df)
    

    just call .sum() if you want an overall value rather than per series as hash_pandas_object returns.