apache-spark pyspark grouping unique window-functions

Create sequential unique id for each group

I'm trying to find an equivalent for the following snippet (reference) to create unique id to every unique combination from two columns in PySpark.

Pandas approach:

df['my_id'] = df.groupby(['foo', 'bar'], sort=False).ngroup() + 1

I tried the following, but it's creating more ids than required:

df = df.withColumn("my_id", F.row_number().over(Window.orderBy('foo', 'bar')))

Solution

Instead of row_number, use dense_rank:

from pyspark.sql import functions as F, Window
df = spark.createDataFrame(
    [('r1', 'ph1'),
     ('r1', 'ph1'),
     ('r1', 'ph2'),
     ('s4', 'ph3'),
     ('s3', 'ph2'),
     ('s3', 'ph2')],
    ['foo', 'bar'])

df = df.withColumn("my_id", F.dense_rank().over(Window.orderBy('foo', 'bar')))
df.show()
# +---+---+-----+
# |foo|bar|my_id|
# +---+---+-----+
# | r1|ph1|    1|
# | r1|ph1|    1|
# | r1|ph2|    2|
# | s3|ph2|    3|
# | s3|ph2|    3|
# | s4|ph3|    4|
# +---+---+-----+