Search code examples
pysparkaws-glue

pyspark dataframe array of struct to columns


I have a dataframe with :

"abc": array [
    "def": struct {
        "id": string,
        "value": string
    }
]

id can be "PHONE", "FAX" and "MAIL" So, this is a sample :

technical_id column_to_explode
1 [["PHONE", "083665xxxx"], ["FAX", "0325xxxxxx"]]
2 [["MAIL", "abc@xxx.com"]]
3 null

Is it possible to transform to :

technical_id column_to_explode PHONE FAX MAIL
1 [["PHONE", "083665xxxx"], ["FAX", "0325xxxxxx"]] 083665xxxx 0325xxxxxx null
2 [["MAIL", "abc@xxx.com"]] null null abc@xxx.com
3 null null null null

I'm trying with explode but it duplicate rows and I would rather to avoid this.

Thanks.


Solution

  • You can do a pivot after the explode to ensure unique ID columns. Here's an example.

    spark.sparkContext.parallelize([([('phone', 'abc'), ('email', 'xyz')], 1), ([('fax', 'klm')], 2)]). \
        toDF(['arr_of_structs', 'id']). \
        selectExpr('*', 'inline(arr_of_structs)'). \
        groupBy('id'). \
        pivot('_1'). \
        agg(func.first('_2')). \
        show(truncate=False)
    
    # +---+-----+----+-----+
    # |id |email|fax |phone|
    # +---+-----+----+-----+
    # |1  |xyz  |null|abc  |
    # |2  |null |klm |null |
    # +---+-----+----+-----+
    

    The input dataframe looks like the following

    spark.sparkContext.parallelize([([('phone', 'abc'), ('email', 'xyz')], 1), ([('fax', 'klm')], 2)]). \
        toDF(['arr_of_structs', 'id']). \
        show(truncate=False)
    
    # +----------------------------+---+
    # |arr_of_structs              |id |
    # +----------------------------+---+
    # |[{phone, abc}, {email, xyz}]|1  |
    # |[{fax, klm}]                |2  |
    # +----------------------------+---+
    
    # root
    #  |-- arr_of_structs: array (nullable = true)
    #  |    |-- element: struct (containsNull = true)
    #  |    |    |-- _1: string (nullable = true)
    #  |    |    |-- _2: string (nullable = true)
    #  |-- id: long (nullable = true)