I have a dataframe with :
"abc": array [
"def": struct {
"id": string,
"value": string
}
]
id can be "PHONE", "FAX" and "MAIL" So, this is a sample :
technical_id | column_to_explode |
---|---|
1 | [["PHONE", "083665xxxx"], ["FAX", "0325xxxxxx"]] |
2 | [["MAIL", "abc@xxx.com"]] |
3 | null |
Is it possible to transform to :
technical_id | column_to_explode | PHONE | FAX | |
---|---|---|---|---|
1 | [["PHONE", "083665xxxx"], ["FAX", "0325xxxxxx"]] | 083665xxxx | 0325xxxxxx | null |
2 | [["MAIL", "abc@xxx.com"]] | null | null | abc@xxx.com |
3 | null | null | null | null |
I'm trying with explode but it duplicate rows and I would rather to avoid this.
Thanks.
You can do a pivot
after the explode
to ensure unique ID columns. Here's an example.
spark.sparkContext.parallelize([([('phone', 'abc'), ('email', 'xyz')], 1), ([('fax', 'klm')], 2)]). \
toDF(['arr_of_structs', 'id']). \
selectExpr('*', 'inline(arr_of_structs)'). \
groupBy('id'). \
pivot('_1'). \
agg(func.first('_2')). \
show(truncate=False)
# +---+-----+----+-----+
# |id |email|fax |phone|
# +---+-----+----+-----+
# |1 |xyz |null|abc |
# |2 |null |klm |null |
# +---+-----+----+-----+
The input dataframe looks like the following
spark.sparkContext.parallelize([([('phone', 'abc'), ('email', 'xyz')], 1), ([('fax', 'klm')], 2)]). \
toDF(['arr_of_structs', 'id']). \
show(truncate=False)
# +----------------------------+---+
# |arr_of_structs |id |
# +----------------------------+---+
# |[{phone, abc}, {email, xyz}]|1 |
# |[{fax, klm}] |2 |
# +----------------------------+---+
# root
# |-- arr_of_structs: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- _1: string (nullable = true)
# | | |-- _2: string (nullable = true)
# |-- id: long (nullable = true)