Search code examples
pythonpysparkhashhex

pyspark - Issue in converting hex to decimal


I am facing an issue while converting hex to decimal (learned from here) in pyspark.

from pyspark.sql.functions import col, sha2, conv, substring

# User data with ZIPs
user_data = [
    ("100052441000101", "21001"),
    ("100052441000102", "21002"),
    ("100052441000103", "21002"),
    ("user1", "21001"),
    ("user2", "21002")
]

df_users = spark.createDataFrame(user_data, ["user_id", "ZIP"])

# Generate SHA-256 hash from the user_id
df_users = df_users.withColumn("hash_key", sha2(col("user_id"), 256))

# Convert the hexadecimal hash (sha2 output) to decimal
df_users = df_users.withColumn("hash_substr1", substring(col('hash_key'), 1, 16))
df_users = df_users.withColumn("hash_substr2", substring(col('hash_key'), 1, 15))
df_users = df_users.withColumn("hash_int1", conv(col('hash_substr1'), 16, 10).cast("bigint"))
df_users = df_users.withColumn("hash_int2", conv(col('hash_substr2'), 16, 10).cast("bigint"))

df_users.show()

The output I get is:

+---------------+-----+--------------------+----------------+---------------+-------------------+-------------------+
|        user_id|  ZIP|            hash_key|    hash_substr1|   hash_substr2|          hash_int1|          hash_int2|
+---------------+-----+--------------------+----------------+---------------+-------------------+-------------------+
|100052441000101|21001|3cf4b90397964f6b2...|3cf4b90397964f6b|3cf4b90397964f6|4392338961672327019| 274521185104520438|
|100052441000102|21002|e18aec7bb2a60b62d...|e18aec7bb2a60b62|e18aec7bb2a60b6|               null|1015753888833888438|
|100052441000103|21002|e55127f9f61bbe433...|e55127f9f61bbe43|e55127f9f61bbe4|               null|1032752028895525860|
|          user1|21001|0a041b9462caa4a31...|0a041b9462caa4a3|0a041b9462caa4a| 721732164412679331|  45108260275792458|
|          user2|21002|6025d18fe48abd451...|6025d18fe48abd45|6025d18fe48abd4|6928174017724202309| 433010876107762644|
+---------------+-----+--------------------+----------------+---------------+-------------------+-------------------+

Note that hash_int1 is null for 2nd and 3rd records. However, when I try to get the corresponding int using python, I get some value:

hexes = ["e18aec7bb2a60b62", "e18aec7bb2a60b6", "e55127f9f61bbe43", "e55127f9f61bbe4"]
[int(h, 16) for h in hexes]

[16252062221342215010, 1015753888833888438, 16524032462328413763, 1032752028895525860]

The values are same when they are not null.

The final objective is to generate replicable random values

df_users = df_users.withColumn("random_value", (col("hash_int1") % 10**12) / 10**12)

Solution

  • Spark's LongType range is -9223372036854775808 to 9223372036854775807.

    However, your value 16524032462328413763 is outside of this range so it cannot store as LongType. If you remove .cast("bigint"), you can see that your values won't be null and have correct values.