Search code examples
pythondataframepyspark

Convert an array of structs to multiple columns


I've been trying to transform an array of structs into multiple columns, but with no luck. In this case I have a struct with two fields, where I'm trying to make one of the fields the name of the column and the other field, the value.

For example, this would be the schema of the original table:

root
 |-- tags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- value: string (nullable = true)

and could generate a table similar to this one:

enter image description here

and as the final result would like to be something like this:

root
 |-- 1: string (nullable = true)
 |-- 2: string (nullable = true)
 |-- 3: string (nullable = true)
 |-- 4: string (nullable = true)

enter image description here

I tried a lot of different approaches

one was to explode the array, then expand the struct and use the values to create the new columns, but didn't work as expected.

.withColumn("tag_id", monotonically_increasing_id())
.withColumn("tag", explode("tags"))
.select("tag_id", "tag", "tag.*")

another approach was to transform the structs where I would rename the field using "named_struct", but no luck.

also tried to convert the struct to a map to see other options, but got stuck

.withColumn("tags", transform("tags", lambda t: create_map(t.key, t.value)))

Solution

  • Assign unique id to each row then use inline to explode array of structs into rows and columns, then group the dataframe by id and pivot on key to reshape into wide format

    result = (
        df
        .withColumn('ix', F.monotonically_increasing_id())
        .select('ix', F.inline('tags'))
        .groupBy('ix')
        .pivot('key')
        .agg(F.first('value'))
        .drop('ix')
    )
    

    result.show()
    
    +---+---+---+---+
    |  1|  2|  3|  4|
    +---+---+---+---+
    |  a|  b|  c|  d|
    +---+---+---+---+