Search code examples
apache-sparkpysparkgroupinguniquewindow-functions

Create sequential unique id for each group


I'm trying to find an equivalent for the following snippet (reference) to create unique id to every unique combination from two columns in PySpark.

Pandas approach:

df['my_id'] = df.groupby(['foo', 'bar'], sort=False).ngroup() + 1

I tried the following, but it's creating more ids than required:

df = df.withColumn("my_id", F.row_number().over(Window.orderBy('foo', 'bar')))

Solution

  • Instead of row_number, use dense_rank:

    from pyspark.sql import functions as F, Window
    df = spark.createDataFrame(
        [('r1', 'ph1'),
         ('r1', 'ph1'),
         ('r1', 'ph2'),
         ('s4', 'ph3'),
         ('s3', 'ph2'),
         ('s3', 'ph2')],
        ['foo', 'bar'])
    
    df = df.withColumn("my_id", F.dense_rank().over(Window.orderBy('foo', 'bar')))
    df.show()
    # +---+---+-----+
    # |foo|bar|my_id|
    # +---+---+-----+
    # | r1|ph1|    1|
    # | r1|ph1|    1|
    # | r1|ph2|    2|
    # | s3|ph2|    3|
    # | s3|ph2|    3|
    # | s4|ph3|    4|
    # +---+---+-----+