I have data that looks like this
{
KTYPE: [{name: "john", surname:"elo"},{name: "dd", surname:"ss"}],
MTYPE: [{name: "dsdsd", id:"elo"},{name: "sdss", id:"sds22"}]
}
I want my data frame columns to be named like the below, without specifying it manually or renaming manually after pyspark explosion function is used.
| KTYPE.name | KTYPE.surname | MTYPE.name | MTYPE.id
Can someone please help me with this
Using the following code, I was able to achieve column names by exploding the data as per requirement.
import json
# Opening JSON file
with open('/dbfs/FileStore/tables/sample.json') as json_file:
data = json.load(json_file)
# Print the type of data variable
print(data)
df = spark.read.json(sc.parallelize([json.dumps(data)]))
df.printSchema()
df.show(truncate=False)
col_names = []
for main_key in data.keys():
col_names.extend([f"{main_key}.{sub_key}" for sub_key in data[main_key][0].keys()])
print(col_names)
required_statement = "df.select("
for column in col_names:
required_statement = required_statement+f"df['{column}'].alias('{column}'),"
required_statement = required_statement[:-1] + ")"
print(required_statement)
#using eval to execute above string expresssion and create a dataframe
f = eval(required_statement)
f.show()
selectExpr
with arrays_zip
function, we can get the required output:final = f.selectExpr(f'inline(arrays_zip(`{"`,`".join(col_names)}`))')
NOTE: select columns by enclosing the name within backticks.