I want to calculate the portion of the value, with only two partitions( where type == red and where type != red)
ID | type | value
-----------------------------
1 | red | 10
2 | blue | 20
3 | yellow | 30
result should be :
ID | type | value | portion
-----------------------------
1 | red | 10 | 1
2 | blue | 20 |0.4
3 | yellow | 30 |0.6
The normal window function in spark only supports partitionby a whole column, but I need the "blue" and "yellow", together recognized as the "non-red" type.
Any idea?
First add a column is_red
to easier differentiate between the two groups. Then you can groupBy
this new column and get the sums for each of the two groups respectively.
To get the fraction (portion), simply divide each row's value by the correct sum, taking into account if the type is red or not. This part can be done using when
and otherwise
in Spark.
Below is the Scala code to do this. There is a sortBy
since when using groupBy
the order of results is not guaranteed. With the sort, sum1
below will contain the total sum for all non-red types while sum2
is the sum for red types.
val sum1 :: sum2 :: _ = df.withColumn("is_red", $"type" === lit("red"))
.groupBy($"is_red")
.agg(sum($"value"))
.collect()
.map(row => (row.getAs[Boolean](0), row.getAs[Long](1)))
.toList
.sortBy(_._1)
.map(_._2)
val df2 = df.withColumn("portion", when($"is_red", $"value"/lit(sum2)).otherwise($"value"/lit(sum1)))
The extra is_red
column can be removed using drop
.