I have a column with a struct, something like this:
|properties |
|[john, doe, 123, foo, bar]|
properties
column type:
('event_properties', 'struct<name:string,surname:string,id:int,extra_field_1:string,extra_field_2:string')
and a list of properties:
["name", "surname", "id"]
I am trying to achieve the following:
df.withColumn("name", df.properties.name)
properties
column to avoid duplication and save some space. PySpark 2, so no way to use dropFields()
Desired outcome:
|properties|name|surname|id |
|[foo, bar]|john|doe |123|
Any help will be appreciated!
# Define a list of columns
include = ["name", "surname", "id"]
exclude = ['extra_field_1', 'extra_field_2']
# Fuction to extract relevant items from the struct field
def get_items(cols):
return [F.col('properties')[c].alias(c) for c in cols]
# Recreate a struct from the exlcuded fields and
# assign the remaining columns to the dataframe
df = df.select(F.struct(*get_items(exclude)).alias('properties'), *get_items(include))
df.show()
+----------+----+-------+---+
|properties|name|surname| id|
+----------+----+-------+---+
|{foo, bar}|john| doe|123|
+----------+----+-------+---+