python dataframe pyspark data-analysis data-wrangling

How to rank duplicate values of a column incremental order in pyspark

I have a sample dataframe as below:

customer id|trigger_id
=======================
1          |1101
2          |1102
3          |1101
4          |1102

Now want to rank every duplicate value of trigger in incremental order as

customer id|trigger_id|rank
===========================
1          |1101      |1
2          |1102      |1
3          |1101      |2
4          |1102      |2

after that have two different data frame one having all even rank record while other having all odd rank records.

sorry for bad formatting.

Thanks in advance.

Solution

Use window rank() function.

Example:

df.show()
#+-----------+----------+
#|customer_id|trigger_id|
#+-----------+----------+
#|          1|      1101|
#|          2|      1102|
#|          3|      1101|
#|          4|      1102|
#+-----------+----------+
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy("trigger_id").orderBy("customer_id")

#using dense_rank()
df.withColumn("rank",rank().over(w)).show()
#+-----------+----------+----+
#|customer_id|trigger_id|rank|
#+-----------+----------+----+
#|          2|      1102|   1|
#|          4|      1102|   2|
#|          1|      1101|   1|
#|          3|      1101|   2|
#+-----------+----------+----+

For unique values use row_number():

df.withColumn("rank",row_number().over(w)).orderBy("customer_id").show()
df.withColumn("rank",dense_rank().over(w)).orderBy("customer_id").show()
#+-----------+----------+----+
#|customer_id|trigger_id|rank|
#+-----------+----------+----+
#|          1|      1101|   1|
#|          2|      1102|   1|
#|          3|      1101|   2|
#|          4|      1102|   2|
#+-----------+----------+----+