Search code examples
apache-sparkpysparkapache-spark-sqlauto-increment

How to add an auto-incrementing column in a dataframe based on another column?


I have a PySpark dataframe similar to below:

order_id    item    qty
123          abc    1
123          abc1   4
234          abc2   5
234          abc3   2
234          abc4   7
123          abc5   5
456          abc6   9
456          abc7   8
456          abc8   9

I want to add an auto-incrementing column based on the column 'order_id' and the expected result is:

order_id    item    qty AutoIncrementingColumn_orderID
123         abc      1   1
123         abc1     4   2
234         abc2     5   1
234         abc3     2   2
234         abc4     7   3
123         abc5     5   3
456         abc6     9   1
456         abc7     8   2
456         abc8     9   3

I couldn't find solutions to generate based on another column, any idea how to achieve?


Solution

  • You can use row_number:

    from pyspark.sql import functions as F, Window
    
    df2 = df.withColumn(
        'AutoIncrementingColumn_orderID', 
        F.row_number().over(Window.partitionBy('order_id').orderBy('item'))
    )
    
    df2.show()
    +--------+----+---+------------------------------+
    |order_id|item|qty|AutoIncrementingColumn_orderID|
    +--------+----+---+------------------------------+
    |     234|abc2|  5|                             1|
    |     234|abc3|  2|                             2|
    |     234|abc4|  7|                             3|
    |     456|abc6|  9|                             1|
    |     456|abc7|  8|                             2|
    |     456|abc8|  9|                             3|
    |     123| abc|  1|                             1|
    |     123|abc1|  4|                             2|
    |     123|abc5|  5|                             3|
    +--------+----+---+------------------------------+