Search code examples
sqlapache-sparkpysparkapache-spark-sqlwindow-functions

Assign Rank to Row based on Alphabetical Order Using Window Functions in PySpark


I'm trying to assign a rank to the rows of a dataframe using a window function over a string column (user_id), based on alphabetical order. So, for example:

user_id | rank_num
-------------------
A       |1
A       |1
A       |1
B       |2
A       |1
B       |2
C       |3
B       |2
B       |2
C       |3

I tried using the following lines of code:

user_window = Window().partitionBy('user_id').orderBy('user_id')
data = (data
       .withColumn('profile_row_num', dense_rank().over(user_window))
)

But I'm getting something like:

user_id | rank_num
-------------------
A       |1
A       |1
A       |1
B       |1
A       |1
B       |1
C       |1
B       |1
B       |1
C       |1

Solution

  • Partition by user_id is unnecessary. This will cause all user_id to fall into their own partition and get a rank of 1. The code below should do what you wanted:

    user_window = Window.orderBy('user_id')
    data = data.withColumn('profile_row_num', dense_rank().over(user_window))