I've been trying to transform an array of structs into multiple columns, but with no luck. In this case I have a struct with two fields, where I'm trying to make one of the fields the name of the column and the other field, the value.
For example, this would be the schema of the original table:
root
|-- tags: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
and could generate a table similar to this one:
and as the final result would like to be something like this:
root
|-- 1: string (nullable = true)
|-- 2: string (nullable = true)
|-- 3: string (nullable = true)
|-- 4: string (nullable = true)
I tried a lot of different approaches
one was to explode the array, then expand the struct and use the values to create the new columns, but didn't work as expected.
.withColumn("tag_id", monotonically_increasing_id())
.withColumn("tag", explode("tags"))
.select("tag_id", "tag", "tag.*")
another approach was to transform the structs where I would rename the field using "named_struct", but no luck.
also tried to convert the struct to a map to see other options, but got stuck
.withColumn("tags", transform("tags", lambda t: create_map(t.key, t.value)))
Assign unique id to each row then use inline
to explode array of structs into rows and columns, then group the dataframe by id and pivot on key to reshape into wide format
result = (
df
.withColumn('ix', F.monotonically_increasing_id())
.select('ix', F.inline('tags'))
.groupBy('ix')
.pivot('key')
.agg(F.first('value'))
.drop('ix')
)
result.show()
+---+---+---+---+
| 1| 2| 3| 4|
+---+---+---+---+
| a| b| c| d|
+---+---+---+---+