Search code examples
pysparkautonumber

Pyspark Autonumber over a partitioning column


I have a column in my data frame that is sensitive. I need to replace the sensitive value with a number, but have to do it so that the distinct counts of the column in question stays accurate. I was thinking of a sql function over a window partition. But couldn't find a way.

A sample dataframe is below.

    df = (sc.parallelize([
    {"sensitive_id":"1234"},
    {"sensitive_id":"1234"}, 
    {"sensitive_id":"1234"},
    {"sensitive_id":"2345"},
    {"sensitive_id":"2345"},
    {"sensitive_id":"6789"},
    {"sensitive_id":"6789"},
    {"sensitive_id":"6789"},
    {"sensitive_id":"6789"}
 ]).toDF()
.cache()
      )

enter image description here

I would like to create a dataframe like below.

What is a way to get this done.

enter image description here


Solution

  • You are looking for dense_rank function :

    df.withColumn(
      "non_sensitive_id",
      F.dense_rank().over(Window.partitionBy().orderBy("sensitive_id"))
    ).show()
    
    +------------+----------------+
    |sensitive_id|non_sensitive_id|
    +------------+----------------+
    |        1234|               1|
    |        1234|               1|
    |        1234|               1|
    |        2345|               2|
    |        2345|               2|
    |        6789|               3|
    |        6789|               3|
    |        6789|               3|
    |        6789|               3|
    +------------+----------------+