I process large >1GB csv files with pandas. The script should detect if the data in dataframe is different to the data in the previos run. I can't store the past dataframe. I'm looking for fast function which returns a kind of hash value from pandas dataframe. So that I could store and compare only those "hash-like" values.
import joblib
joblib.hash(df)
This should work?
Also this undocumented hash exists in pandas 20.1:
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
just call .sum()
if you want an overall value rather than per series as hash_pandas_object
returns.