Search code examples
pythondataframepysparkdata-analysisdata-wrangling

How to rank duplicate values of a column incremental order in pyspark


I have a sample dataframe as below:

customer id|trigger_id
=======================
1          |1101
2          |1102
3          |1101
4          |1102

Now want to rank every duplicate value of trigger in incremental order as

customer id|trigger_id|rank
===========================
1          |1101      |1
2          |1102      |1
3          |1101      |2
4          |1102      |2

after that have two different data frame one having all even rank record while other having all odd rank records.

sorry for bad formatting.

Thanks in advance.


Solution

  • Use window rank() function.

    Example:

    df.show()
    #+-----------+----------+
    #|customer_id|trigger_id|
    #+-----------+----------+
    #|          1|      1101|
    #|          2|      1102|
    #|          3|      1101|
    #|          4|      1102|
    #+-----------+----------+
    from pyspark.sql.functions import *
    from pyspark.sql import *
    w=Window.partitionBy("trigger_id").orderBy("customer_id")
    
    #using dense_rank()
    df.withColumn("rank",rank().over(w)).show()
    #+-----------+----------+----+
    #|customer_id|trigger_id|rank|
    #+-----------+----------+----+
    #|          2|      1102|   1|
    #|          4|      1102|   2|
    #|          1|      1101|   1|
    #|          3|      1101|   2|
    #+-----------+----------+----+
    

    For unique values use row_number():

    df.withColumn("rank",row_number().over(w)).orderBy("customer_id").show()
    df.withColumn("rank",dense_rank().over(w)).orderBy("customer_id").show()
    #+-----------+----------+----+
    #|customer_id|trigger_id|rank|
    #+-----------+----------+----+
    #|          1|      1101|   1|
    #|          2|      1102|   1|
    #|          3|      1101|   2|
    #|          4|      1102|   2|
    #+-----------+----------+----+