Search code examples
scalaapache-sparkscala-spark

Explode nested list of objects into DataFrame in Spark


I have a dataframe that looks like this

|               Column                           |
|------------------------------------------------|
|[{a: 2, b: 4}, {a: 2, b: 3}]                    |
|------------------------------------------------|
|[{a: 12, b: 14}, {a: 25, b: 33}, {a: 22, b: 31}]|
...

And I need to convert it to dataframe like

| a | b |
|---|---|
| 2 | 4 |
| 2 | 3 |
|12 |13 |

Solution

  • Simplest approach might be to use SparkSQL function inline as shown below:

    case class AB(a: Int, b: Int)
    
    val df = Seq(
        Seq(AB(2, 4), AB(2,3)),
        Seq(AB(12, 14), AB(25, 33), AB(22, 31))
      ).toDF("arrAB")
    
    df.select(inline($"arrAB")).show
    /*
    +---+---+
    |  a|  b|
    +---+---+
    |  2|  4|
    |  2|  3|
    | 12| 14|
    | 25| 33|
    | 22| 31|
    +---+---+
    */
    

    Note that while inline has been part of the SparkSQL API since 2.0, it's available as a built-in function for Dataframes only on Spark 3.4+. To use it on older Spark versions, wrap it with expr like below:

    df.select(expr("inline(arrAB)"))