I have a sample dataframe as below:
customer id|trigger_id
=======================
1 |1101
2 |1102
3 |1101
4 |1102
Now want to rank every duplicate value of trigger in incremental order as
customer id|trigger_id|rank
===========================
1 |1101 |1
2 |1102 |1
3 |1101 |2
4 |1102 |2
after that have two different data frame one having all even rank record while other having all odd rank records.
sorry for bad formatting.
Thanks in advance.
Use window rank()
function.
Example:
df.show()
#+-----------+----------+
#|customer_id|trigger_id|
#+-----------+----------+
#| 1| 1101|
#| 2| 1102|
#| 3| 1101|
#| 4| 1102|
#+-----------+----------+
from pyspark.sql.functions import *
from pyspark.sql import *
w=Window.partitionBy("trigger_id").orderBy("customer_id")
#using dense_rank()
df.withColumn("rank",rank().over(w)).show()
#+-----------+----------+----+
#|customer_id|trigger_id|rank|
#+-----------+----------+----+
#| 2| 1102| 1|
#| 4| 1102| 2|
#| 1| 1101| 1|
#| 3| 1101| 2|
#+-----------+----------+----+
For unique values use row_number()
:
df.withColumn("rank",row_number().over(w)).orderBy("customer_id").show()
df.withColumn("rank",dense_rank().over(w)).orderBy("customer_id").show()
#+-----------+----------+----+
#|customer_id|trigger_id|rank|
#+-----------+----------+----+
#| 1| 1101| 1|
#| 2| 1102| 1|
#| 3| 1101| 2|
#| 4| 1102| 2|
#+-----------+----------+----+