Search code examples
jsondataframescalaapache-spark

Convert Dataframe to nested Json in scala


I have a dataframe as below structure each level of json with *1, *2 like that.. and “->” shows the child node of a parent node

dataframe.show
id*1 name*1 ppu*1 type*1 toppings1->id2 toppings1->type2 batters1->batter2->id*3 batter2->type3
0001 Cake 0.55 donut 5001 None 1001 Regular
0001 Cake 0.55 donut 5002 Glazed 1002 Chocolate

I need ouput as nested json as below

{
    "items":
        {
            "item":
                [
                    {
                        "id": "0001",
                        "type": "donut",
                        "name": "Cake",
                        "ppu": 0.55,
                        "batters":
                            {
                                "batter":
                                    [
                                        { "id": "1001", "type": "Regular" },
                                        { "id": "1002", "type": "Chocolate" },
                                        { "id": "1003", "type": "Blueberry" },
                                        { "id": "1004", "type": "Devil's Food" }
                                    ]
                            },
                        "topping":
                            [
                                { "id": "5001", "type": "None" },
                                { "id": "5002", "type": "Glazed" },
                                { "id": "5005", "type": "Sugar" },
                                { "id": "5007", "type": "Powdered Sugar" },
                                { "id": "5006", "type": "Chocolate with Sprinkles" },
                                { "id": "5003", "type": "Chocolate" },
                                { "id": "5004", "type": "Maple" }
                            ]
                    },

                    ...

                ]
        }
}

I have tried converting the dataframe as

dataframe.toJson

which gives me wrong output, please help me in how to iterate the dataframe and create a nested json as above


Solution

  • Step 1: combine the type and id columns into a struct:

    import org.apache.spark.sql._
    import org.apache.spark.sql.functions._
    
    val df = ...
    val df1 = df.withColumn("topping", struct(col("toppings1->id2").as("id"), col("toppings1->type2").as("type")))
                .withColumn("batters", struct(col("batters1->batter2->id*3").as("id"), col("batter2->type3").as("type")))
    

    Result:

    root
     |-- id*1: string (nullable = true)
     |-- name*1: string (nullable = true)
     |-- ppu*1: string (nullable = true)
     |-- type*1: string (nullable = true)
     |-- toppings1->id2: string (nullable = true)
     |-- toppings1->type2: string (nullable = true)
     |-- batters1->batter2->id*3: string (nullable = true)
     |-- batter2->type3: string (nullable = true)
     |-- topping: struct (nullable = false)
     |    |-- id: string (nullable = true)
     |    |-- type: string (nullable = true)
     |-- batters: struct (nullable = false)
     |    |-- id: string (nullable = true)
     |    |-- type: string (nullable = true)
    

    Step 2* : group by id*1:

    val df2 = df1.groupBy("id*1")
                             .agg(first("name*1").as("name"),
                                  first("ppu*1").as("ppu"),
                                  first("type*1").as("type"),
                                  collect_list("topping").as("toppings"),
                                  collect_list("batters").as("batters"))
                             .withColumnRenamed("id*1", "id")
    

    Result:

    root
     |-- id: string (nullable = true)
     |-- name: string (nullable = true)
     |-- ppu: string (nullable = true)
     |-- type: string (nullable = true)
     |-- toppings: array (nullable = false)
     |    |-- element: struct (containsNull = false)
     |    |    |-- id: string (nullable = true)
     |    |    |-- type: string (nullable = true)
     |-- batters: array (nullable = false)
     |    |-- element: struct (containsNull = false)
     |    |    |-- id: string (nullable = true)
     |    |    |-- type: string (nullable = true)
    

    Step 2*: convert into Json:

    df2.select(to_json(struct("id", "name", "ppu", "type", "toppings", "batters"))).show(truncate=false)
    

    Result:

    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |to_json(struct(id, name, ppu, type, toppings, batters))                                                                                                                                                   |
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |{"id":"0001","name":"Cake","ppu":"0.55","type":"donut","toppings":[{"id":"5001","type":"None"},{"id":"5002","type":"Glazed"}],"batters":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}]}|
    +----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+