Search code examples
pythonjsonapache-sparkpyspark

Create dataframe from Nested JSON


I have the below json

[{"Name":"Tom","Age":"40","Account":"savings","address": {
            "city": "New York",
            "state": "NY"
        }}]

Now I need to create dataframe using spark from this JSON with below structure

Name Age Account city state

Below is code I am using

schema2= StructType([
    StructField("TICKET", StringType(), True),
    StructField("TRANFERRED", StringType(), True),
    StructField("ACCOUNT", StringType(), True),
    StructField("address", StructType([StructField('city', StringType(), True), StructField('state', StringType(), True)]), True),
])

path='dbfs:/FileStore/new.json'
df = spark.read.schema(schema2).option("multiLine", True).json(path)

And I am getting below structure

enter image description here

What schema change should be done to flatten the inner json as columns ?


Solution

  • Please check the below code

    flattened_df = df.select(
        col("Name"),
        col("Age"),
        col("Account"),
        col("address.city").alias("city"),
        col("address.state").alias("state")
    )