Search code examples
jsonpyspark

Flattening nested JSON file into a PySpark DF


I am new to PySpark and I am struggling with flattening a nested json file into a PySpark data frame.

I need to define the schema for the JSON data. I know how to define schema for regular json files but I get stuck at defining the schema for this case. I need generalised version of this example, not the hard coded version.

In this example I have a verified manifest and the string AB1 should go under verifiedManifest_name and quantity should go under verifiedManifest_quantity. I don't know how many inputs to the verifiedManifest I'll have nor how they will be named.

"verifiedManifest":{"AB1":{"quantity":1}, "DE5":{"quantity":5}, "AG1":{"quantity":10}}

Output for this case will look like this:

verifiedManifest_name    verifiedManifest_quantity
AB1                            1
DE5                            5
AG1                            10
DOL1                           100
BG1                            3
etc...

Solution

  • Suppose you have your data in a JSON file:

    input_df = spark.read.text('./test.json', wholetext = True)
    input_string = input_df.collect()[0]['value']
    input_dict = eval(input_string)
    
    df = spark.createDataFrame([
        (input_dict['verifiedManifest'],)
    ], ['verifiedManifest'])
    
    df = (
        df
        .select(f.explode(f.col('verifiedManifest')))
        .withColumnRenamed('key', 'verifiedManifest_name')
        .withColumn('verifiedManifest_value', f.element_at(f.col('value'), 'quantity'))
        .select('verifiedManifest_name', 'verifiedManifest_value')
    )
    
    df.show(truncate= False)
    

    and the output will look like:

    +---------------------+----------------------+                                  
    |verifiedManifest_name|verifiedManifest_value|
    +---------------------+----------------------+
    |AB1                  |1                     |
    |AG1                  |10                    |
    |DE5                  |5                     |
    +---------------------+----------------------+