Search code examples
apache-sparkuser-defined-functions

How to add new field to nested array of struct column in spark <= 2.3


I have a data frame with schema like below

root
     |-- date: timestamp (nullable = true)
     |-- questionAnswerList: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- questionNumber: string (nullable = true)
     |    |    |-- listAnswers: array (nullable = true) 
     |    |    |    |-- element: string(containsNull = true)


And i want to add a new field inside the array of struct like the schema below

root
     |-- date: timestamp (nullable = true)
     |-- questionAnswerList: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- index: integer (nullable = true)
     |    |    |-- questionNumber: string (nullable = true)
     |    |    |-- listAnswers: array (nullable = true) 
     |    |    |    |-- element: string(containsNull = true)

I tried to use a UDF like below

val  addIndexInStruct: UserDefinedFunction = udf((data: Seq[Row]) => {
    data.zipWithIndex.map{case (Row(x:String,y:Array[String]), index) => (index, x, y )}
  })

df.withColumn("newCol",addIndexInStruct($"questionAnswerList")).show(false)

But i have the following error :

Caused by: scala.MatchError: ([Q10,WrappedArray(R10.1, R10.2)],0) (of class scala.Tuple2)

Anybody has an idea how to do this in spark 2.X ? I saw in others posts that in spark 3.X, transform function can be used


Solution

  • I finally solved it. Seq had to be used instead of Array in the pattern matching part

    val  addIndexInStruct: UserDefinedFunction = udf((data: Seq[Row]) => {
        data.zipWithIndex.map{case (Row(x: String,y: Seq[String]), index) => (index, x, y )}
      })