Search code examples
apache-sparkdataframeapache-spark-sql

spark dataframe sum of column based on condition


I want to calculate the portion of the value, with only two partitions( where type == red and where type != red)

ID     | type        | value
-----------------------------
  1    |    red      |  10
  2    |    blue     |  20
  3    |    yellow   |  30

result should be :

ID     | type        | value | portion
-----------------------------
  1    |    red      |  10   | 1
  2    |    blue     |  20   |0.4
  3    |    yellow   |  30   |0.6

The normal window function in spark only supports partitionby a whole column, but I need the "blue" and "yellow", together recognized as the "non-red" type.

Any idea?


Solution

  • First add a column is_red to easier differentiate between the two groups. Then you can groupBy this new column and get the sums for each of the two groups respectively.

    To get the fraction (portion), simply divide each row's value by the correct sum, taking into account if the type is red or not. This part can be done using when and otherwise in Spark.

    Below is the Scala code to do this. There is a sortBy since when using groupBy the order of results is not guaranteed. With the sort, sum1 below will contain the total sum for all non-red types while sum2 is the sum for red types.

    val sum1 :: sum2 :: _ = df.withColumn("is_red", $"type" === lit("red"))
      .groupBy($"is_red")
      .agg(sum($"value"))
      .collect()
      .map(row => (row.getAs[Boolean](0), row.getAs[Long](1)))
      .toList
      .sortBy(_._1)
      .map(_._2)
    
    val df2 = df.withColumn("portion", when($"is_red", $"value"/lit(sum2)).otherwise($"value"/lit(sum1)))
    

    The extra is_red column can be removed using drop.