Are Pandas hash functions stable over time? (pd.util.hash_pandas_object)

I want to create ID columns using hash functions in my pandas dataframes. The pipeline will be reprocessed over time, and I need to ensure that the hash functions in Pandas are stable across different versions and environments.

I am using a composite key consisting of multiple columns to generate these hashes. I am currently using pd.util.hash_pandas_object for its speed, but I couldn't find information in the documentation regarding its stability over time. Is pd.util.hash_pandas_object stable across different versions of Pandas? If not, could you suggest a fast and stable alternative for hashing composite keys in DataFrames?

Solution

It seems fairly stable up to now.

Assuming this example:

import pandas as pd

print(pd.__version__)

df = pd.DataFrame({'col1': [0, 1, 2], 'col2': ['A', 'B', 'C']})
pd.util.hash_pandas_object(df)

Output:

# pandas 1.0.3
0     3633373482604536162
1     5258867552551810711
2    13022556061186435711
dtype: uint64

# pandas 1.4.3
0     3633373482604536162
1     5258867552551810711
2    13022556061186435711
dtype: uint64

# pandas 2.2.2
0     3633373482604536162
1     5258867552551810711
2    13022556061186435711
dtype: uint64

Note however that the function is sensitive to the dtype:

# automatic conversion is fine
pd.util.hash_pandas_object(df.convert_dtypes())

0     3633373482604536162
1     5258867552551810711
2    13022556061186435711
dtype: uint64

# upcasting is not
pd.util.hash_pandas_object(df.astype({'col1': float}))

0     3633373482604536162
1    12198058518291636952
2     7562945033953410876
dtype: uint64