Search code examples
pythonpysparkazure-databricks

Cumulatative Subtraction in pyspark


I want to achieve cumulative Subtraction in pyspark. I have dataset like this

councs coitm
1000 1110
100 1110
50 1110
30 1110
20 1110
2000 1210
10 1210
200 1210
-100 1210
20 1210

My desirable result is this :-

councs coitm _uncs
1000 1110 1000
100 1110 900
50 1110 850
30 1110 820
20 1110 800
2000 1210 2000
10 1210 1990
200 1210 1790
-100 1210 1890
20 1210 1870

For this I tried following code:-

df = _cost_srt.orderBy("coitm")
partition_by = Window().partitionBy("COITM").orderBy(F.desc("COCHDJ"))
df = df.withColumn("rnb", F.row_number().over(partition_by))

df = df.withColumn(
    "_UNCS", F.when(F.col("rnb") == 1, F.col("COUNCS")).otherwise(F.lit(None))
)


_output = df.withColumn(
    "_UNCS",
    F.when(
        F.col("rnb") > 1, F.lag(F.col("_UNCS")).over(partition_by) - F.col("COUNCS")
    ).otherwise(F.col("_UNCS")),
)

I achieve desirable output for rnb 1 and rnb 2 only, after that _uncs become null. How can I acheive my desirable code? Please help me.


Solution

  • To achieve you scenario try below code:

    • Here I created column with negative values and the added the result bac to achieve desired output
    df = cost_srt.orderBy("coitm")
    window_spec = Window.partitionBy("coitm").orderBy(F.desc("councs"))
    df = df.withColumn("rnb", F.row_number().over(window_spec))
    df.show()
    df = df.withColumn("negatives", F.when(F.col("rnb") == 1, F.col("COUNCS")).otherwise(- df["COUNCS"]))
    df.show()
    
    result_df = df.withColumn("_UNCS", F.sum("negatives").over(window_spec))
    result_df.show()
    

    enter image description here