How to detect that large pandas dataframe has different data then before

I process large >1GB csv files with pandas. The script should detect if the data in dataframe is different to the data in the previos run. I can't store the past dataframe. I'm looking for fast function which returns a kind of hash value from pandas dataframe. So that I could store and compare only those "hash-like" values.

Solution

import joblib
joblib.hash(df)

This should work?

Also this undocumented hash exists in pandas 20.1:

from pandas.util import hash_pandas_object
h = hash_pandas_object(df)

just call .sum() if you want an overall value rather than per series as hash_pandas_object returns.

matplotlib 3D scatter plot alpha varies when viewing different angles
How to write very long string that conforms with PEP8 and prevent E501
Getting Home Directory with pathlib
how to avoid bot detection on websites using selenium python
Python mock to create a fake object return a dictionary when any of its attributes are used
Polars vs. Pandas: size and speed difference
How to mock.patch a class imported in another module
Python - error cannot determine truth value of Relational (Newton-Raphson)
ProcessPoolExecutor logging fails to log inside function on Windows but not on Unix / Mac
SQLAlchemy ORM Insert or Update when importing from JSON
django managers vs proxy models
Pytroch clamp for complex values
For every identifier select only rows with largest order column
truth value for Expr is ambiguous in with_columns ternary expansion on dates
Remove equal characters from two python strings
Python pyad module can't set UPN
Macro VS Micro VS Weighted VS Samples F1 Score
Printing a Tree data structure in Python
How to fix/reset decreasing timestamps while preserving gaps in time-series data for CNN training?
Test that module is NOT imported
Pyserial module isn't installed on PATH
Print a multiplication table in Python
Python: ModuleNotFoundError: No module named 'xyz'
Receiving Import Error: No Module named ***, but has __init__.py
PyQt5 QProgressBar border radius issue
URL-encoding and -decoding a string in Python
Fastest way to find the smallest possible sum of the absolute differences of pairs within a single array?
Flask: Update Code Reference for: current_app._get_current_object()
Export Charts from Excel as images using Python
Align yaxis label spanning two axes with yaxis labels of one axes in subplots