Search code examples
pythonpysparkdatabricksazure-databricks

Pyspark Keep Hierarchy in field names after explode function


I have data that looks like this

{

KTYPE: [{name: "john", surname:"elo"},{name: "dd", surname:"ss"}],
MTYPE: [{name: "dsdsd", id:"elo"},{name: "sdss", id:"sds22"}]
}

I want my data frame columns to be named like the below, without specifying it manually or renaming manually after pyspark explosion function is used.

| KTYPE.name | KTYPE.surname | MTYPE.name | MTYPE.id

Can someone please help me with this


Solution

  • Using the following code, I was able to achieve column names by exploding the data as per requirement.

    • First, I have read the JSON file to create a dictionary of data.
    import json
    
    # Opening JSON file
    with open('/dbfs/FileStore/tables/sample.json') as json_file:
        data = json.load(json_file)
    
    # Print the type of data variable
    print(data)
    

    enter image description here

    • Now, I have created a dataframe using this dictionary (can be done directly but we need the dictionary data).
    df = spark.read.json(sc.parallelize([json.dumps(data)]))
    df.printSchema()
    df.show(truncate=False)
    
    • Building the final required array of column names using the following code:
    col_names = []
    for main_key in data.keys():
        col_names.extend([f"{main_key}.{sub_key}" for sub_key in data[main_key][0].keys()])
    print(col_names)
    

    enter image description here

    • Now, I have created a string expression that will help in selecting each of the property and change the column name using alias:
    required_statement = "df.select("
    
    for column in col_names:
        required_statement = required_statement+f"df['{column}'].alias('{column}'),"
    required_statement = required_statement[:-1] + ")"
    print(required_statement)
    
    #using eval to execute above string expresssion and create a dataframe
    f = eval(required_statement)
    f.show()
    

    enter image description here

    • Now using the selectExpr with arrays_zip function, we can get the required output:
    final = f.selectExpr(f'inline(arrays_zip(`{"`,`".join(col_names)}`))')
    

    enter image description here

    NOTE: select columns by enclosing the name within backticks.