Search code examples
scalaapache-sparkapache-spark-sql

How to translate a complex nested JSON structure into multiple columns in a Spark DataFrame


I am learning Scala, and am trying to filter a select few columns from a large nested json file to make into a DataFrame. This is the gist of the json:

{
  “meta”: 
    {“a”: 1, b: 2”}    // I want to ignore meta
  “objects”:
  [
    {
         “caucus”: “Progressive”,
     “person” : 
         {
          “name”: “Mary”,
          “party”: “Green Party”,
          “age”: 50,
          “gender”: “female” // etc..
         }
    }, // etc.
   ] 
}

Hence the data looks like this as is, read in with spark.

    val df = spark.read.json("file")
    df.show()
+--------------------+--------------------+
|                meta|             objects|
+--------------------+--------------------+
|[limit -> 100.0, ...|[[, [116.0, 117.0...|
+--------------------+--------------------+

Instead of this, I want a DataFrame with the columns: Name | Party | Caucus.

I've messed around with explode() and have reproduced the schema as a StructType(), but am not sure how to deal with a nested structure like this in general.


Solution

  • You can use ".*" on a column of type struct to tranform it it into multiple fields columns:

    val df = spark.read.json("file.json")
    df.select(col("meta"), explode(col("objects")).as("objects"))
      .select("meta.*", "objects.*")
      .select("a", "b", "caucus", "person.*")
      .show(false)
    
    
    +---+---+-----------+---+------+----+-----------+
    |a  |b  |caucus     |age|gender|name|party      |
    +---+---+-----------+---+------+----+-----------+
    |1  |2  |Progressive|50 |female|Mary|Green Party|
    +---+---+-----------+---+------+----+-----------+