Table contains following list of columns.
ID | A_ID | DESC | DATE |
---|---|---|---|
100 | 200 | PQ | 2023-05-02 10:30:40 |
100 | 200 | XY | 2023-04-30 21:20:26 |
100 | 200 | AB | 2023-04-11 15:19:10 |
100 | 200 | DE | 2023-05-02 13:52:23 |
100 | 201 | LM | 2023-04-30 21:20:26 |
100 | 201 | FH | 2023-04-11 15:19:10 |
100 | 201 | OP | 2023-05-02 13:52:23 |
101 | 301 | YQ | 2023-05-02 11:52:23 |
Need the output in below format in spark scala dataframe. Sorting should be on Date field.
ID | A_ID | COMMENT |
---|---|---|
100 | 200;201 | INTERNAL : 200 DESCRIPTION: 2023-04-11 15:19:10:AB 2023-04-30 21:20:26:XY 2023-05-02 10:30:40:PQ 2023-05-02 13:52:23:DE INTERNAL : 201 DESCRIPTION: 2023-04-11 15:19:10:FH 2023-04-30 21:20:26:LM 2023-05-02 13:52:23:OP |
101 | 301 | INTERNAL : 301 DESCRIPTION: 2023-05-02 11:52:23:YQ |
Use two groupBy
s: first group by ID
and A_ID
, do the necessary string operations for the comment and then group by ID
again:
import org.apache.spark.sql.functions._
val spark = ...
import spark.implicits._
val df = ...
df.groupBy("ID", "A_ID") //first group by
.agg(collect_list(concat_ws(":",'DATE, 'DESC)).alias("COMMENT"))
.withColumn("COMMENT", concat_ws("\n", array_sort('COMMENT))) //combine the comments for an A_ID
.withColumn("COMMENT", concat(lit("INTERNAL: "), 'A_ID, //add the intro to the comments
lit(" DESCRIPTION: "), 'COMMENT))
.orderBy('ID, 'A_ID)
.groupBy('ID) //second group by
.agg(concat_ws(";",collect_list('A_ID)).alias("A_ID"), //combine all comments for an ID
concat_ws("\n", collect_list('COMMENT)).alias("COMMENT"))
.show(truncate=false)
Output (manually formatted because show
does not handle well multi-line strings):
+---+-------+--------------------------------------------------+
|ID |A_ID |COMMENT |
+---+-------+--------------------------------------------------+
|101|301 |INTERNAL: 301 DESCRIPTION: 2023-05-02 11:52:23:YQ |
|100|200;201|INTERNAL: 200 DESCRIPTION: 2023-04-11 15:19:10:AB
2023-04-30 21:20:26:XY
2023-05-02 10:30:40:PQ
2023-05-02 13:52:23:DE
INTERNAL: 201 DESCRIPTION: 2023-04-11 15:19:10:FH
2023-04-30 21:20:26:LM
2023-05-02 13:52:23:OP|
+---+-------+--------------------------------------------------+