I have a nested json, structured as the following example:
{'A':[{'key':'B','value':'C'},{'key':'D','value':'E'}]}
Now I want to map this to the following schema:
|--A
|--|--B
|--|--D
e.g. A structure recovered from a json like:
{'A':{'B':'C','D':'E'}}
The array in 'A' has no fixed number of entries, but the contained dicts always have the two keys 'key','value'
Please find the script below.
from pyspark.sql.functions import lit, col, explode, create_map, collect_list
from itertools import chain
>>> sample.printSchema()
root
|-- A: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
>>> final_df = (sample
... .select(explode('A').alias("A"))
... .withColumn("A",create_map("A.key", "A.value"))
... .groupby().agg(collect_list("A").alias("A"))
... )
>>> final_df.printSchema()
root
|-- A: array (nullable = true)
| |-- element: map (containsNull = false)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
>>> final_df.show(truncate=False)
+--------------------+
|A |
+--------------------+
|[[B -> C], [D -> E]]|
+--------------------+
>>> (final_df
... .write
... .format("json")
... .mode("overwrite")
... .save("sample_files/2020-09-29/out")
... )