Search code examples
scalaapache-spark

Merge two columns which conatins array of values and create one column in scala?


I have a dataframe having two array column. I am trying to merge this two column into one single column by merging each value sep by :. E.g in below example subject and mark should be merged and form a column of string type which will have values like [eng:40,math:20]. Can someone give some pointer here

 import spark.implicits._
val columns=Array("id", "subject","mark")
val df1=sc.parallelize(Seq(
  (1, Array("eng","math"),Array("10","20"))
  
)).toDF(columns: _*)

df1.printSchema
df1.show()

enter image description here Expected dataframe output

id,newcol
1,[eng:40,math:20]

Solution

  • Check below code.

    df1.selectExpr(
        "id", 
        """
        TRANSFORM(
            ARRAYS_ZIP(subject, mark),
            e -> CONCAT( e.subject, ':', e.mark )
        ) as newcol
    """
    )
    .show(false)
    
    +---+-----------------+
    |id |newcol           |
    +---+-----------------+
    |1  |[eng:10, math:20]|
    +---+-----------------+
    

    OR

    val newColExpr = transform(
        arrays_zip($"subject", $"mark"),
        e => concat(e.getItem("subject"), lit(":"), e.getItem("mark"))
    ).as("newcol")
    
    df1.select($"id", newColExpr).show(false)