Search code examples
scalaapache-spark

Is is possible to performa group by taking in all the fields in aggregate?


I am on apache spark 3.3.2. Here is a sample code

val df: Dataset[Row] = ???

df
 .groupBy($"someKey")
 .agg(collect_set(???)) //I want to collect all the columns here including the key.

As mentioned in the comment I want to collect all the columns and not have to specify all the columns again. Is there a way to do this?


Solution

  • If your intention is to aggregate all elements that match the same key as a list of json objects you can perform something like:

    import org.apache.spark.sql.functions._
    import org.apache.spark.sql.DataFrame
    
    val df = spark.sqlContext.createDataFrame(Seq(
          ("steak", "1990-01-01", "2022-03-30", 150),
          ("steak", "2000-01-02", "2021-01-13", 180),
          ("fish",  "1990-01-01", "2001-02-01", 100)
        )).toDF("key", "startDate", "endDate", "price")
    
    df.show()
    
    df
     .groupBy("key")
     .agg(collect_set(struct($"*")).as("value"))
     .show(false)
    

    output:

    +-----+----------+----------+-----+
    |  key| startDate|   endDate|price|
    +-----+----------+----------+-----+
    |steak|1990-01-01|2022-03-30|  150|
    |steak|2000-01-02|2021-01-13|  180|
    | fish|1990-01-01|2001-02-01|  100|
    +-----+----------+----------+-----+
    
    +-----+----------------------------------------------------------------------------+
    |key  |value                                                                       |
    +-----+----------------------------------------------------------------------------+
    |steak|[{steak, 1990-01-01, 2022-03-30, 150}, {steak, 2000-01-02, 2021-01-13, 180}]|
    |fish |[{fish, 1990-01-01, 2001-02-01, 100}]                                       |
    +-----+----------------------------------------------------------------------------+