Search code examples
pysparkazure-databricks

Make row 'Total' be the last row in pyspark dataframe


I have this dataframe

Column A Balance Paid Amount Id
Total 1000 10000 null
1211 0 5000 1
1211 1000 5000 2

This code I tried for that:

df = (df.withColumn("Id", monotonically_increasing_id()+1)
           )

ANd when I try this to get my desire output:

from pyspark.sql.window import Window as W
from pyspark.sql import functions as F

df1 = df.withColumn("Id", F.monotonically_increasing_id()+1)
windowSpec = W.orderBy("Id")
df1 = df1.withColumn("Id", F.row_number().over(windowSpec)).show()

This is the df I received:

Column A Balance Paid Amount Id
Total 1400 1100 null
1211 800 500 1
1211 600 600 2

This is the df desire output:

Column A Balance Paid Amount Id
1211 800 500 1
1211 600 600 2
Total 1400 1100 3

EDIT:

I tried the below answers from this post but I still got the same error. The values of Balance & Paid Amount is different from the 1st dataframe.

df1:

Column A Balance Paid Amount Id
Total 2000 1500 null
1211 800 500 1
1211 600 600 2
1211 400 300 3
1211 200 100 4

codes I have tried:

from pyspark.sql import functions as F

total_row = df.filter(F.col("Column A") == "Total")
other_rows = df.filter(F.col("Column A") != "Total")

final_df = other_rows.union(total_row)
final_df.show()


from pyspark.sql import functions as F
from pyspark.sql.window import Window as W

df1 = df1.withColumn("Flag", F.when(F.col("Column A") == "Total", 1).otherwise(0))
windowSpec = W.orderBy("Flag", "Column A")
df1 = df1.withColumn("Id", F.row_number().over(windowSpec))
df1 = df1.drop("Flag")
df1.show()

same output:

Column A Balance Paid Amount Id
1211 800 500 1
1211 800 500 2
1211 800 500 3
1211 200 100 4
Total 2000 1500 5

Correct result of Total for two columns, however, the values have changed.


Solution

  • Here is the sample data frame:

    Column A Balance Paid Amount Id
    Total 1400 1100 NULL
    1211 800 500 1
    1211 600 600 2

    You can use below code to get your required output data frame:

    from pyspark.sql import functions as F
    from pyspark.sql.window import Window as W
    
    df1 = df1.withColumn("Flag", F.when(F.col("Column A") == "Total", 1).otherwise(0))
    windowSpec = W.orderBy("Flag", "Column A")
    df1 = df1.withColumn("Id", F.row_number().over(windowSpec))
    df1 = df1.drop("Flag")
    df1.show()
    

    You will get the below dataframe as output:

    Column A Balance Paid Amount Id
    1211 800 500 1
    1211 600 600 2
    Total 1400 1100 3