scala apache-spark apache-spark-sql rdd apache-spark-dataset

Summary of a Column (Achieving a Cube Function on Spark Dataset)

For below Dataset I need to get a Summary Data based on Selected Column The sample Dataset contains Following Below Data.

+---------+----------+--------+---------+
| Column1 | Column2  | Expend | Expend2 |
+---------+----------+--------+---------+
| School1 | Student1 | 5      | 10      |
+---------+----------+--------+---------+
| School1 | Student2 | 11     | 12      |
+---------+----------+--------+---------+
| School2 | Student1 | 6      | 8       |
+---------+----------+--------+---------+
| School2 | Student2 | 7      | 8       |
+---------+----------+--------+---------+

I need to get Summary Data for Column2 as below,

Required Format

+---------+----------+--------+---------+
| Column1 | Column2  | Expend | Expend2 |
+---------+----------+--------+---------+
| School1 | Total    | 16     | 22      |
+---------+----------+--------+---------+
| School1 | Student1 | 5      | 10      |
+---------+----------+--------+---------+
| School1 | Student2 | 11     | 12      |
+---------+----------+--------+---------+
| School2 | Total    | 13     | 16      |
+---------+----------+--------+---------+
| School2 | Student1 | 6      | 8       |
+---------+----------+--------+---------+
| School2 | Student2 | 7      | 8       |
+---------+----------+--------+---------+

I tried using cube function on dataset but that didn't give me expected results. I get null values in place of Total which is also okay, but the data I dont get in above format.

I wanted to try using dataset.cube("Column2").agg(sum("Expend1"),sum("Expend2"));

But this above line of code gives me only data for Column2 , How Can I retrieve Column1 values with above return data.

Solution

From your existing dataframe you can create a total dataframe where you groupBy Column1 and sum all the Expend columns as

import org.apache.spark.sql.functions._
val totaldf = df.groupBy("Column1").agg(lit("Total").as("Column2"), sum("Expend").as("Expend"), sum("Expend2").as("Expend2"))

Then you just merge them

df.union(totaldf).orderBy(col("Column1"), col("Column2").desc).show(false)

You should have your desired output

+-------+--------+------+-------+
|Column1|Column2 |Expend|Expend2|
+-------+--------+------+-------+
|School1|Total   |16.0  |22.0   |
|School1|Student2|11    |12     |
|School1|Student1|5     |10     |
|School2|Total   |13.0  |16.0   |
|School2|Student2|7     |8      |
|School2|Student1|6     |8      |
+-------+--------+------+-------+