Search code examples
azure-databricksdatabricks-sql

monotonically_increasing_id function giving same value when used twice


In Databricks notebook I am using monotonically_increasing_id function and creating 2 columns.

Both these columns have the same value i.e. value in each row of COL1 is same as value in COL2 for the same row.

Could you explain why monotonically_increasing_id function is behaving this way


Solution

  • The monotonically_increasing_id function in Spark generates a unique ID for each row, ensuring that each ID value is larger than the previous ones. This function is deterministic and relies on the partition (and its ordering) of your DataFrame to generate IDs.

    When you use monotonically_increasing_id to create two columns in the same operation or within the same transformation stage, Spark calculates those IDs based on the same order of rows within each partition. Because both columns are computed within the same distributed computation context, each row will receive identical ID values in both columns.

    I have tried the below approach:

    from pyspark.sql.functions import monotonically_increasing_id
    
    df = spark.read.csv("/databricks-datasets/Rdatasets/data-001/csv/ggplot2/diamonds.csv", 
                        header=True, 
                        inferSchema=True)
    df = df.withColumn("COL1", monotonically_increasing_id())\
           .withColumn("COL2", monotonically_increasing_id())
    df.show()
    

    Results:

    +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+----+----+
    |_c0|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|COL1|COL2|
    +---+-----+---------+-----+-------+-----+-----+-----+----+----+----+----+----+
    |  1| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|   0|   0|
    |  2| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|   1|   1|
    |  3| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|   2|   2|
    |  4| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|   3|   3|
    |  5| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|   4|   4|
    |  6| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|   5|   5|
    |  7| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|   6|   6|
    |  8| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|   7|   7|
    |  9| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|   8|   8|
    | 10| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|   9|   9|
    

    In the above code, I am adding two columns with monotonically increasing IDs.