python pyspark databricks azure-databricks

Pyspark Keep Hierarchy in field names after explode function

I have data that looks like this

{

KTYPE: [{name: "john", surname:"elo"},{name: "dd", surname:"ss"}],
MTYPE: [{name: "dsdsd", id:"elo"},{name: "sdss", id:"sds22"}]
}

I want my data frame columns to be named like the below, without specifying it manually or renaming manually after pyspark explosion function is used.

| KTYPE.name | KTYPE.surname | MTYPE.name | MTYPE.id

Can someone please help me with this

Solution

Using the following code, I was able to achieve column names by exploding the data as per requirement.

First, I have read the JSON file to create a dictionary of data.

import json

# Opening JSON file
with open('/dbfs/FileStore/tables/sample.json') as json_file:
    data = json.load(json_file)

# Print the type of data variable
print(data)

enter image description here

Now, I have created a dataframe using this dictionary (can be done directly but we need the dictionary data).

df = spark.read.json(sc.parallelize([json.dumps(data)]))
df.printSchema()
df.show(truncate=False)

Building the final required array of column names using the following code:

col_names = []
for main_key in data.keys():
    col_names.extend([f"{main_key}.{sub_key}" for sub_key in data[main_key][0].keys()])
print(col_names)

enter image description here

Now, I have created a string expression that will help in selecting each of the property and change the column name using alias:

required_statement = "df.select("

for column in col_names:
    required_statement = required_statement+f"df['{column}'].alias('{column}'),"
required_statement = required_statement[:-1] + ")"
print(required_statement)

#using eval to execute above string expresssion and create a dataframe
f = eval(required_statement)
f.show()

enter image description here

Now using the selectExpr with arrays_zip function, we can get the required output:

final = f.selectExpr(f'inline(arrays_zip(`{"`,`".join(col_names)}`))')

enter image description here

NOTE: select columns by enclosing the name within backticks.