Search code examples
pythondataframepyspark

PySpark: explode values in list from struct and delete them from this struct


I have a column with a struct, something like this:

|properties                |
|[john, doe, 123, foo, bar]|

properties column type: ('event_properties', 'struct<name:string,surname:string,id:int,extra_field_1:string,extra_field_2:string')

and a list of properties:

["name", "surname", "id"]

I am trying to achieve the following:

  1. Extract fields from properties, but only those from the list. In pure Python it could be done with simple loop but I am looking for more Sparky way. I know that I can call df.withColumn("name", df.properties.name)
  2. Remove extracted fields from the properties column to avoid duplication and save some space. PySpark 2, so no way to use dropFields()

Desired outcome:

|properties|name|surname|id |
|[foo, bar]|john|doe    |123|

Any help will be appreciated!


Solution

  • Code

    # Define a list of columns
    include = ["name", "surname", "id"]
    exclude = ['extra_field_1', 'extra_field_2']
    
    # Fuction to extract relevant items from the struct field
    def get_items(cols):
        return [F.col('properties')[c].alias(c) for c in cols]
    
    # Recreate a struct from the exlcuded fields and 
    # assign the remaining columns to the dataframe
    df = df.select(F.struct(*get_items(exclude)).alias('properties'), *get_items(include))
    

    Result

    df.show()
    
    +----------+----+-------+---+
    |properties|name|surname| id|
    +----------+----+-------+---+
    |{foo, bar}|john|    doe|123|
    +----------+----+-------+---+