I am on apache spark 3.3.2
. Here is a sample code
val df: Dataset[Row] = ???
df
.groupBy($"someKey")
.agg(collect_set(???)) //I want to collect all the columns here including the key.
As mentioned in the comment I want to collect all the columns and not have to specify all the columns again. Is there a way to do this?
If your intention is to aggregate all elements that match the same key as a list of json objects you can perform something like:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.DataFrame
val df = spark.sqlContext.createDataFrame(Seq(
("steak", "1990-01-01", "2022-03-30", 150),
("steak", "2000-01-02", "2021-01-13", 180),
("fish", "1990-01-01", "2001-02-01", 100)
)).toDF("key", "startDate", "endDate", "price")
df.show()
df
.groupBy("key")
.agg(collect_set(struct($"*")).as("value"))
.show(false)
output:
+-----+----------+----------+-----+
| key| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2022-03-30| 150|
|steak|2000-01-02|2021-01-13| 180|
| fish|1990-01-01|2001-02-01| 100|
+-----+----------+----------+-----+
+-----+----------------------------------------------------------------------------+
|key |value |
+-----+----------------------------------------------------------------------------+
|steak|[{steak, 1990-01-01, 2022-03-30, 150}, {steak, 2000-01-02, 2021-01-13, 180}]|
|fish |[{fish, 1990-01-01, 2001-02-01, 100}] |
+-----+----------------------------------------------------------------------------+