Search code examples
apache-sparkpyspark

How to assign unique ids to entries in a column using PySpark?


For example, if I have a dataframe with a name column, where each name can occur multiple times:

+------+
| name |
+------+
|Alice |
|Bob   |
|Alice |
|Chloe |
|Chloe |
+------+

I want to have a column where each name gets a unique id starting from 0:

+------+----+
| name | id |
+------+----+
|Alice | 0  |
|Bob   | 1  |
|Alice | 0  |
|Chloe | 2  |
|Chloe | 2  |
+------+----+

How do I achieve this using PySpark? One possible way is to create a data frame with a column of distinct names and assign index such as using row_number. But that involves joining back to the original table. So I wonder if there's a direct way to achieve this.


Solution

  • Use window functions. Logic and code below

       new =(df.withColumn('order', row_number().over(Window.partitionBy(lit('1')).orderBy(lit('1'))))#Create increasing id
              .withColumn('id', dense_rank().over(Window.partitionBy().orderBy('name'))-1).orderBy('order')#Use window function; dense_ranl to generate new id
              .drop('order')
             ).show()
    
    +-------+---+
    |   name| id|
    +-------+---+
    |  Alice|  0|
    |  Bravo|  1|
    |Charlie|  2|
    |  Alice|  0|
    |  Bravo|  1|
    +-------+---+