Search code examples
pysparkazure-databricks

Autogenerated and unique id of type bigint in Azure databricks pyspark


I want to create an autogenerated id and it has to be unique across the job and of type Integer. If I use monotonically_increasing_id it generates unique id but only for the particular job run, when the next batch is running, it collides with the id.

I've couple of columns like batchid which is datetime stamp and if I concat 'monotonically_increasing_id' it could work but due to spark parallelism and bigint range could be an issue.

I have also tried window partition with runid (job id) as well, but it is not giving me the unique value.

I'm listing all the options I have tried, any help could be useful.

df = spark.table("table")
windowSpec = Window.partitionBy("runid", "batchid").orderBy("batchid")

df.withColumn("id", (hash(concat(col("id"), lit("_"), col("batchid"), lit("_"), col("runid"))) % 10000000000).cast("bigint"))


df.withColumn("id", (abs(hash(concat(col("batchid"), lit("_"), monotonically_increasing_id(), lit("_")))) % 10000000000).cast("bigint"))

df = df.withColumn(
"Id",
(hash(concat(
    format_string("%010d", unix_timestamp(current_timestamp()) * 1000),  # Timestamp in milliseconds
    format_string("%05d", monotonically_increasing_id() % 100000),       # Ensuring uniqueness for the same millisecond, restrict to 5 digits
    concat_ws("_", *unique_cols)                                        # Concatenate the unique columns
)) + lit(2**31)).cast("bigint"))

df = df.withColumn("unique_id", 
               (monotonically_increasing_id() * 1000000000) +  
               (row_number().over(windowSpec) * 100000) +       
               col("runid").cast("bigint") +                    
               col("batchid").cast("bigint")) 

Solution

  • I tried this, and appears to be working good.

    scale = 100000000000000 #10^14
    df.withColumn("id",(monotonically_increasing_id() * scale +col("batch").cast("bigint")).cast("bigint"),)
    

    This is working and dont find the collision. Batch is a timestamp.