Search code examples
dataframescalaapache-spark

Group concat in spark scala dataframe


Table contains following list of columns.

ID A_ID DESC DATE
100 200 PQ 2023-05-02 10:30:40
100 200 XY 2023-04-30 21:20:26
100 200 AB 2023-04-11 15:19:10
100 200 DE 2023-05-02 13:52:23
100 201 LM 2023-04-30 21:20:26
100 201 FH 2023-04-11 15:19:10
100 201 OP 2023-05-02 13:52:23
101 301 YQ 2023-05-02 11:52:23

Need the output in below format in spark scala dataframe. Sorting should be on Date field.

ID A_ID COMMENT
100 200;201 INTERNAL : 200 DESCRIPTION: 2023-04-11 15:19:10:AB
2023-04-30 21:20:26:XY
2023-05-02 10:30:40:PQ
2023-05-02 13:52:23:DE
INTERNAL : 201 DESCRIPTION: 2023-04-11 15:19:10:FH
2023-04-30 21:20:26:LM
2023-05-02 13:52:23:OP
101 301 INTERNAL : 301 DESCRIPTION: 2023-05-02 11:52:23:YQ

Solution

  • Use two groupBys: first group by ID and A_ID, do the necessary string operations for the comment and then group by ID again:

    import org.apache.spark.sql.functions._
    val spark = ...
    import spark.implicits._
    val df = ...
    
    df.groupBy("ID", "A_ID")                                           //first group by
      .agg(collect_list(concat_ws(":",'DATE, 'DESC)).alias("COMMENT"))
      .withColumn("COMMENT", concat_ws("\n", array_sort('COMMENT)))    //combine the comments for an A_ID
      .withColumn("COMMENT", concat(lit("INTERNAL: "), 'A_ID,          //add the intro to the comments
                                   lit(" DESCRIPTION: "), 'COMMENT))
      .orderBy('ID, 'A_ID)
      .groupBy('ID)                                                    //second group by
      .agg(concat_ws(";",collect_list('A_ID)).alias("A_ID"),           //combine all comments for an ID
          concat_ws("\n", collect_list('COMMENT)).alias("COMMENT"))
      .show(truncate=false)
    

    Output (manually formatted because show does not handle well multi-line strings):

    +---+-------+--------------------------------------------------+
    |ID |A_ID   |COMMENT                                           |
    +---+-------+--------------------------------------------------+
    |101|301    |INTERNAL: 301 DESCRIPTION: 2023-05-02 11:52:23:YQ                                                                                                                                                                     |
    |100|200;201|INTERNAL: 200 DESCRIPTION: 2023-04-11 15:19:10:AB
                 2023-04-30 21:20:26:XY
                 2023-05-02 10:30:40:PQ
                 2023-05-02 13:52:23:DE
                 INTERNAL: 201 DESCRIPTION: 2023-04-11 15:19:10:FH
                 2023-04-30 21:20:26:LM
                 2023-05-02 13:52:23:OP|
    +---+-------+--------------------------------------------------+