Search code examples
apache-spark-sqldata-partitioning

Split string values in equal and same partition


I need to split my data into 80 partitions regardless of what is the key of the data and each time the data should retrun the same partition value. Is there any alogorithm which can be used to implement the same. The key is combination of multiple fields.

I am planning to generate a surrogate Key for the key cobmination and apply range functon using min and max values split the data into desired number of parititons . But if the same key arrives tommorow i have to look back to get the surrogate key so that same keys fall on the same partition.

Is there any existing algorithm/formula pyspark function where i pass a string value it will return a same number each time and it make sure the it distributes the string value equally?

df_1=spark.sql("select column_1,column_2,column_2,hash(column_1) % 20 as part from temptable")
df_1.createOrReplaceTempView("test")
spark.sql("select  part,count(*) from test group by part").show(160,False)

Solution

  • If you can't use a numeric key and just take a modulus, then...

    Use a stable hash on a string value to a number, such as the python hash() built in and do a mod 80 on it. It will sort neatly into 80 buckets (numbered 0 - 79).

    e.g. something like this:

    bucket = abs(hash(key_string) % 80)