I have nested JSON files and each of them I need to put in one cell of dataframe.
Original idea is to take nested json, create one more column with value of key called "DataType", put whole json file in the second column and write it out in S3 bucket partitioned by that data type, code of writing looks like this:
def write_data(data_df, output_path):
data_df.coalesce(100).write.partitionBy("DataType").mode("append").parquet(output_path)
Basically it will be sorting glue job.
I have tried this:
df = dyf.toDF()
df2 = df.withColumn("data", lit(df.toJSON().first()))
And it looks fine until I take multiple JSON files to process, in the output I have the same json in every row because of this first().
Adding a working solution here :
df2 = df.withColumn("data", to_json(struct([df[col] for col in df.columns])))