I have a dataframe as below structure each level of json with *1, *2 like that.. and “->” shows the child node of a parent node
dataframe.show
id*1 | name*1 | ppu*1 | type*1 | toppings1->id2 | toppings1->type2 | batters1->batter2->id*3 | batter2->type3 |
---|---|---|---|---|---|---|---|
0001 | Cake | 0.55 | donut | 5001 | None | 1001 | Regular |
0001 | Cake | 0.55 | donut | 5002 | Glazed | 1002 | Chocolate |
I need ouput as nested json as below
{
"items":
{
"item":
[
{
"id": "0001",
"type": "donut",
"name": "Cake",
"ppu": 0.55,
"batters":
{
"batter":
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
{ "id": "1003", "type": "Blueberry" },
{ "id": "1004", "type": "Devil's Food" }
]
},
"topping":
[
{ "id": "5001", "type": "None" },
{ "id": "5002", "type": "Glazed" },
{ "id": "5005", "type": "Sugar" },
{ "id": "5007", "type": "Powdered Sugar" },
{ "id": "5006", "type": "Chocolate with Sprinkles" },
{ "id": "5003", "type": "Chocolate" },
{ "id": "5004", "type": "Maple" }
]
},
...
]
}
}
I have tried converting the dataframe as
dataframe.toJson
which gives me wrong output, please help me in how to iterate the dataframe and create a nested json as above
Step 1: combine the type
and id
columns into a struct:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val df = ...
val df1 = df.withColumn("topping", struct(col("toppings1->id2").as("id"), col("toppings1->type2").as("type")))
.withColumn("batters", struct(col("batters1->batter2->id*3").as("id"), col("batter2->type3").as("type")))
Result:
root
|-- id*1: string (nullable = true)
|-- name*1: string (nullable = true)
|-- ppu*1: string (nullable = true)
|-- type*1: string (nullable = true)
|-- toppings1->id2: string (nullable = true)
|-- toppings1->type2: string (nullable = true)
|-- batters1->batter2->id*3: string (nullable = true)
|-- batter2->type3: string (nullable = true)
|-- topping: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
|-- batters: struct (nullable = false)
| |-- id: string (nullable = true)
| |-- type: string (nullable = true)
Step 2* : group by id*1
:
val df2 = df1.groupBy("id*1")
.agg(first("name*1").as("name"),
first("ppu*1").as("ppu"),
first("type*1").as("type"),
collect_list("topping").as("toppings"),
collect_list("batters").as("batters"))
.withColumnRenamed("id*1", "id")
Result:
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- ppu: string (nullable = true)
|-- type: string (nullable = true)
|-- toppings: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
|-- batters: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
Step 2*: convert into Json:
df2.select(to_json(struct("id", "name", "ppu", "type", "toppings", "batters"))).show(truncate=false)
Result:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|to_json(struct(id, name, ppu, type, toppings, batters)) |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|{"id":"0001","name":"Cake","ppu":"0.55","type":"donut","toppings":[{"id":"5001","type":"None"},{"id":"5002","type":"Glazed"}],"batters":[{"id":"1001","type":"Regular"},{"id":"1002","type":"Chocolate"}]}|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+