I have this dataframe
Column A | Balance | Paid Amount | Id |
---|---|---|---|
Total | 1000 | 10000 | null |
1211 | 0 | 5000 | 1 |
1211 | 1000 | 5000 | 2 |
This code I tried for that:
df = (df.withColumn("Id", monotonically_increasing_id()+1)
)
ANd when I try this to get my desire output:
from pyspark.sql.window import Window as W
from pyspark.sql import functions as F
df1 = df.withColumn("Id", F.monotonically_increasing_id()+1)
windowSpec = W.orderBy("Id")
df1 = df1.withColumn("Id", F.row_number().over(windowSpec)).show()
This is the df I received:
Column A | Balance | Paid Amount | Id |
---|---|---|---|
Total | 1400 | 1100 | null |
1211 | 800 | 500 | 1 |
1211 | 600 | 600 | 2 |
This is the df desire output:
Column A | Balance | Paid Amount | Id |
---|---|---|---|
1211 | 800 | 500 | 1 |
1211 | 600 | 600 | 2 |
Total | 1400 | 1100 | 3 |
EDIT:
I tried the below answers from this post but I still got the same error. The values of Balance & Paid Amount is different from the 1st dataframe.
df1:
Column A | Balance | Paid Amount | Id |
---|---|---|---|
Total | 2000 | 1500 | null |
1211 | 800 | 500 | 1 |
1211 | 600 | 600 | 2 |
1211 | 400 | 300 | 3 |
1211 | 200 | 100 | 4 |
codes I have tried:
from pyspark.sql import functions as F
total_row = df.filter(F.col("Column A") == "Total")
other_rows = df.filter(F.col("Column A") != "Total")
final_df = other_rows.union(total_row)
final_df.show()
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
df1 = df1.withColumn("Flag", F.when(F.col("Column A") == "Total", 1).otherwise(0))
windowSpec = W.orderBy("Flag", "Column A")
df1 = df1.withColumn("Id", F.row_number().over(windowSpec))
df1 = df1.drop("Flag")
df1.show()
same output:
Column A | Balance | Paid Amount | Id |
---|---|---|---|
1211 | 800 | 500 | 1 |
1211 | 800 | 500 | 2 |
1211 | 800 | 500 | 3 |
1211 | 200 | 100 | 4 |
Total | 2000 | 1500 | 5 |
Correct result of Total for two columns, however, the values have changed.
Here is the sample data frame:
Column A | Balance | Paid Amount | Id |
---|---|---|---|
Total | 1400 | 1100 | NULL |
1211 | 800 | 500 | 1 |
1211 | 600 | 600 | 2 |
You can use below code to get your required output data frame:
from pyspark.sql import functions as F
from pyspark.sql.window import Window as W
df1 = df1.withColumn("Flag", F.when(F.col("Column A") == "Total", 1).otherwise(0))
windowSpec = W.orderBy("Flag", "Column A")
df1 = df1.withColumn("Id", F.row_number().over(windowSpec))
df1 = df1.drop("Flag")
df1.show()
You will get the below dataframe as output:
Column A | Balance | Paid Amount | Id |
---|---|---|---|
1211 | 800 | 500 | 1 |
1211 | 600 | 600 | 2 |
Total | 1400 | 1100 | 3 |