Convert Dataframe to nested Json in scala

I have a dataframe as below structure each level of json with *1, *2 like that.. and “->” shows the child node of a parent node

dataframe.show

id*1	name*1	ppu*1	type*1	toppings1->id2	toppings1->type2	batters1->batter2->id*3	batter2->type3
0001	Cake	0.55	donut	5001	None	1001	Regular
0001	Cake	0.55	donut	5002	Glazed	1002	Chocolate

I need ouput as nested json as below

{
    "items":
        {
            "item":
                [
                    {
                        "id": "0001",
                        "type": "donut",
                        "name": "Cake",
                        "ppu": 0.55,
                        "batters":
                            {
                                "batter":
                                    [
                                        { "id": "1001", "type": "Regular" },
                                        { "id": "1002", "type": "Chocolate" },
                                        { "id": "1003", "type": "Blueberry" },
                                        { "id": "1004", "type": "Devil's Food" }
                                    ]
                            },
                        "topping":
                            [
                                { "id": "5001", "type": "None" },
                                { "id": "5002", "type": "Glazed" },
                                { "id": "5005", "type": "Sugar" },
                                { "id": "5007", "type": "Powdered Sugar" },
                                { "id": "5006", "type": "Chocolate with Sprinkles" },
                                { "id": "5003", "type": "Chocolate" },
                                { "id": "5004", "type": "Maple" }
                            ]
                    },

                    ...

                ]
        }
}

I have tried converting the dataframe as

dataframe.toJson

which gives me wrong output, please help me in how to iterate the dataframe and create a nested json as above

Solution

Step 1: combine the type and id columns into a struct:

import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val df = ...
val df1 = df.withColumn("topping", struct(col("toppings1->id2").as("id"), col("toppings1->type2").as("type")))
            .withColumn("batters", struct(col("batters1->batter2->id*3").as("id"), col("batter2->type3").as("type")))

Result:

root
 |-- id*1: string (nullable = true)
 |-- name*1: string (nullable = true)
 |-- ppu*1: string (nullable = true)
 |-- type*1: string (nullable = true)
 |-- toppings1->id2: string (nullable = true)
 |-- toppings1->type2: string (nullable = true)
 |-- batters1->batter2->id*3: string (nullable = true)
 |-- batter2->type3: string (nullable = true)
 |-- topping: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- type: string (nullable = true)
 |-- batters: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |    |-- type: string (nullable = true)

Step 2* : group by id*1:

val df2 = df1.groupBy("id*1")
                         .agg(first("name*1").as("name"),
                              first("ppu*1").as("ppu"),
                              first("type*1").as("type"),
                              collect_list("topping").as("toppings"),
                              collect_list("batters").as("batters"))
                         .withColumnRenamed("id*1", "id")

Result:

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- ppu: string (nullable = true)
 |-- type: string (nullable = true)
 |-- toppings: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)
 |-- batters: array (nullable = false)
 |    |-- element: struct (containsNull = false)
 |    |    |-- id: string (nullable = true)
 |    |    |-- type: string (nullable = true)

Step 2*: convert into Json:

df2.select(to_json(struct("id", "name", "ppu", "type", "toppings", "batters"))).show(truncate=false)

Result:

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|to_json(struct(id, name, ppu, type, toppings, batters))                                                                                                                                                   |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"id":"0001","name":"Cake","ppu":"0.55","type":"donut","toppings":[{"id":"5001","type":"None"},{"id":"5002","type":"Glazed"}],"batters":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+