I have a data frame with the following nesting in Pyspark:
>content: struct
importantId: string
>data: array
>element: struct
importantCol0: string
importantCol1: string
I need the following output:
importantId | importantCol0 | importantCol1 |
---|---|---|
10800005 | 0397AZ | 0397AZ |
10800006 | 0397BZ | 0397BZ |
I tried the following code:
df1 = df0.select(F.col('content.*'))
I got:
importantId | data |
---|---|
10800005 | {importantCol0: 0397AZ, importantCol1: 0397AZ} |
10800006 | {importantCol0: 0397BZ, importantCol1: 0397BZ} |
I followed with:
df2 = df1.select(F.col('importantId'), F.col('data.*')
but I get the following error:
AnalysisException: Can only star expand struct data types. Attribute: ArrayBuffer(data).
Does anyone know how to fix this? I was expecting a way to expand an array like a struct
Let us use inline
to explode array of structs to columns and rows
result = df.select('*', F.inline('data')).drop('data')
Example,
df.show()
+------------+--------------------+
|importantCol| data|
+------------+--------------------+
| 1| [{1, 2}, {4, 3}]|
| 2|[{10, 20}, {40, 30}]|
+------------+--------------------+
result.show()
+------------+-------------+-------------+
|importantCol|importantCol0|importantCol1|
+------------+-------------+-------------+
| 1| 1| 2|
| 1| 4| 3|
| 2| 10| 20|
| 2| 40| 30|
+------------+-------------+-------------+