I have a df
with the following schema,
g_hut: string
date: date
arr_data:array
element:struct
Id:string
Q_Id:string
Q_Type:string
I want to convert the arr_data
column from Array(Struct)
to Array(Map)
.
g_hut: string
date: date
arr_data:array
element:map
key:string
value:string
Original arr_data
column's Row looks like this,
arr_data:
[
{'Id': '12a', 'Q_Id': 'uac', 'Q_Type': 'action'},
{'Id': '', 'Q_Id': '', ''},
{'Id': '76v', 'Q_Id': '', 'Q_Type': 'form'}
]
I tried the following,
df = df.withColumn("arr_data_map", f.array(f.create_map(
f.lit("Id"), f.col("arr_data.Id"),
f.lit("Q_Id"), f.col("arr_data.Q_Id"),
f.lit("Q_Type"), f.col("arr_data.Q_Type")
)))
I get the following result,
[
{'Id': ['12a', '', '76v']},
{'Q_Id': ['uac', '','']},
{'Q_Type': ['action', '', 'form']}
]
This is not what I want. I want the original arr_data
with the Map
schema as mentioned above. How can I achieve this?
Below to create a sample df
(original) with schema that has array(struct),
data = [
('A', datetime.date(2022, 1, 1), [{'Id': '12a', 'Q_Id': 'uac', 'Q_Type': 'action'},
{'Id': '', 'Q_Id': '', 'Q_Type': ''},
{'Id': '76v', 'Q_Id': '', 'Q_Type': 'form'}]),
('B', datetime.date(2022, 1, 2), [{'Id': '34b', 'Q_Id': 'abc', 'Q_Type': 'action'},
{'Id': '56c', 'Q_Id': 'def', 'Q_Type': 'form'},
{'Id': '78d', 'Q_Id': 'ghi', 'Q_Type': 'action'}])
]
# Define the schema
schema = t.StructType([t.StructField("g_hut", t.StringType()),
t.StructField("date", t.DateType()),
t.StructField("arr_data", t.ArrayType(
t.StructType([
t.StructField("Id", t.StringType()),
t.StructField("Q_Id", t.StringType()),
t.StructField("Q_Type", t.StringType())]))
)
])
# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)
Apply a higher-order transformation function to transform each struct inside the array to the corresponding map representation:
out = df.withColumn(
'arr_data',
F.transform(
'arr_data',
lambda x: F.create_map(
F.lit('Id'), x.Id,
F.lit('Q_Id'), x.Q_Id,
F.lit('Q_Type'), x.Q_Type
)
)
)
out.show()
Output:
+-----+----------+----------------------------------------------------------------------------------------------------------------------------------+
|g_hut|date |arr_data |
+-----+----------+----------------------------------------------------------------------------------------------------------------------------------+
|A |2022-01-01|[{Id -> 12a, Q_Id -> uac, Q_Type -> action}, {Id -> , Q_Id -> , Q_Type -> }, {Id -> 76v, Q_Id -> , Q_Type -> form}] |
|B |2022-01-02|[{Id -> 34b, Q_Id -> abc, Q_Type -> action}, {Id -> 56c, Q_Id -> def, Q_Type -> form}, {Id -> 78d, Q_Id -> ghi, Q_Type -> action}]|
+-----+----------+----------------------------------------------------------------------------------------------------------------------------------+