Search code examples
pythonjsonpyspark

Cannot explode a nested JSON within spark dataframe


I am new to Spark. I am trying to flatten a dataframe but failed to do so with "explode".

The original dataframe schema is like below:

ID|ApprovalJSON
1|[{"ApproverType":"1st Line Manager","Status":"Approved"},{"ApproverType":"2nd Line Manager","Status":"Approved"}]
2|[{"ApproverType":"1st Line Manager","Status":"Approved"},{"ApproverType":"2nd Line Manager","Status":"Rejected"}]

I need to convert it to below schema?

ID|ApprovalType|Status
1|1st Line Manager|Approved
1|2nd Line Manager|Approved
2|1st Line Manager|Approved
2|2nd Line Manager|Rejected

I have tried

df_exploded = df.withColumn("ApprovalJSON", explode("ApprovalJSON"))

But I got error:

Cannot resolve "explode(ApprovalJSON)" due to data type mismatch: parameter 1 requires ("ARRAY" or "MAP") type, however, "ApprovalJSON" is of "STRING" type.;


Solution

  • First parse the JSON like strings to array of structs then use inline to explode array into rows and columns

    df1 = df.withColumn("ApprovalJSON", F.from_json("ApprovalJSON", schema="array<struct<ApproverType string, Status string>>"))
    df1 = df1.select("ID", F.inline('ApprovalJSON'))
    

    Result

    df1.show()
    
    +---+----------------+--------+
    | ID|    ApproverType|  Status|
    +---+----------------+--------+
    |  1|1st Line Manager|Approved|
    |  1|2nd Line Manager|Approved|
    |  2|1st Line Manager|Approved|
    |  2|2nd Line Manager|Rejected|
    +---+----------------+--------+