I am facing an issue while converting hex to decimal (learned from here) in pyspark.
from pyspark.sql.functions import col, sha2, conv, substring
# User data with ZIPs
user_data = [
("100052441000101", "21001"),
("100052441000102", "21002"),
("100052441000103", "21002"),
("user1", "21001"),
("user2", "21002")
]
df_users = spark.createDataFrame(user_data, ["user_id", "ZIP"])
# Generate SHA-256 hash from the user_id
df_users = df_users.withColumn("hash_key", sha2(col("user_id"), 256))
# Convert the hexadecimal hash (sha2 output) to decimal
df_users = df_users.withColumn("hash_substr1", substring(col('hash_key'), 1, 16))
df_users = df_users.withColumn("hash_substr2", substring(col('hash_key'), 1, 15))
df_users = df_users.withColumn("hash_int1", conv(col('hash_substr1'), 16, 10).cast("bigint"))
df_users = df_users.withColumn("hash_int2", conv(col('hash_substr2'), 16, 10).cast("bigint"))
df_users.show()
The output I get is:
+---------------+-----+--------------------+----------------+---------------+-------------------+-------------------+
| user_id| ZIP| hash_key| hash_substr1| hash_substr2| hash_int1| hash_int2|
+---------------+-----+--------------------+----------------+---------------+-------------------+-------------------+
|100052441000101|21001|3cf4b90397964f6b2...|3cf4b90397964f6b|3cf4b90397964f6|4392338961672327019| 274521185104520438|
|100052441000102|21002|e18aec7bb2a60b62d...|e18aec7bb2a60b62|e18aec7bb2a60b6| null|1015753888833888438|
|100052441000103|21002|e55127f9f61bbe433...|e55127f9f61bbe43|e55127f9f61bbe4| null|1032752028895525860|
| user1|21001|0a041b9462caa4a31...|0a041b9462caa4a3|0a041b9462caa4a| 721732164412679331| 45108260275792458|
| user2|21002|6025d18fe48abd451...|6025d18fe48abd45|6025d18fe48abd4|6928174017724202309| 433010876107762644|
+---------------+-----+--------------------+----------------+---------------+-------------------+-------------------+
Note that hash_int1
is null
for 2nd and 3rd records.
However, when I try to get the corresponding int using python, I get some value:
hexes = ["e18aec7bb2a60b62", "e18aec7bb2a60b6", "e55127f9f61bbe43", "e55127f9f61bbe4"]
[int(h, 16) for h in hexes]
[16252062221342215010, 1015753888833888438, 16524032462328413763, 1032752028895525860]
The values are same when they are not null.
The final objective is to generate replicable random values
df_users = df_users.withColumn("random_value", (col("hash_int1") % 10**12) / 10**12)
Spark's LongType range is -9223372036854775808 to 9223372036854775807.
However, your value 16524032462328413763
is outside of this range so it cannot store as LongType. If you remove .cast("bigint")
, you can see that your values won't be null and have correct values.