Search code examples
pysparkrow-number

pyspark row number reset for every 4 digits completion


I am applying row_number on particular column based on one column partition. how to reset the row number to starting number 1 once it reaches 999 through pyspark?

partcode row number
12345 0001
12345 0002
12345 0003
12345 9999 -- Reaches max
12345 0001
12345 0002

Solution

  • After generating row number:

    1. Subtract 1 from row number to get row number starting from zero
    2. Apply modulo operation using the cut-off value (9999)
    3. Add 1 to the modulo result to get the row number starting from 1 and wrapping back to 1 for row number cut-off value + 1

    In my example I have used a smaller cutoff value of 3

    from pyspark.sql import Window
    from pyspark.sql import functions as F
    
    df = spark.createDataFrame([(12345, )for _ in range(0, 10)], ("partcode", ))
    
    window_spec = Window.partitionBy("partcode").orderBy("partcode")
    
    cut_off = 3
    
    df.withColumn("row number", F.row_number().over(window_spec))\
      .withColumn("row number", F.lpad(((F.col("row number") - F.lit(1)) % F.lit(cut_off)) + F.lit(1), 4, '0'))\
      .show()
    

    Output

    +--------+----------+
    |partcode|row number|
    +--------+----------+
    |   12345|      0001|
    |   12345|      0002|
    |   12345|      0003|
    |   12345|      0001|
    |   12345|      0002|
    |   12345|      0003|
    |   12345|      0001|
    |   12345|      0002|
    |   12345|      0003|
    |   12345|      0001|
    +--------+----------+