I want to create ID columns using hash functions in my pandas dataframes. The pipeline will be reprocessed over time, and I need to ensure that the hash functions in Pandas are stable across different versions and environments.
I am using a composite key consisting of multiple columns to generate these hashes.
I am currently using pd.util.hash_pandas_object
for its speed, but I couldn't find information in the documentation regarding its stability over time. Is pd.util.hash_pandas_object
stable across different versions of Pandas? If not, could you suggest a fast and stable alternative for hashing composite keys in DataFrames?
It seems fairly stable up to now.
Assuming this example:
import pandas as pd
print(pd.__version__)
df = pd.DataFrame({'col1': [0, 1, 2], 'col2': ['A', 'B', 'C']})
pd.util.hash_pandas_object(df)
Output:
# pandas 1.0.3
0 3633373482604536162
1 5258867552551810711
2 13022556061186435711
dtype: uint64
# pandas 1.4.3
0 3633373482604536162
1 5258867552551810711
2 13022556061186435711
dtype: uint64
# pandas 2.2.2
0 3633373482604536162
1 5258867552551810711
2 13022556061186435711
dtype: uint64
Note however that the function is sensitive to the dtype:
# automatic conversion is fine
pd.util.hash_pandas_object(df.convert_dtypes())
0 3633373482604536162
1 5258867552551810711
2 13022556061186435711
dtype: uint64
# upcasting is not
pd.util.hash_pandas_object(df.astype({'col1': float}))
0 3633373482604536162
1 12198058518291636952
2 7562945033953410876
dtype: uint64