Search code examples
pythonarraysapache-sparkpysparkapache-spark-sql

array(struct) to array(map)—PySpark


I have a df with the following schema,

 g_hut: string
 date: date
 arr_data:array
     element:struct
         Id:string
         Q_Id:string
         Q_Type:string

I want to convert the arr_data column from Array(Struct) to Array(Map).

g_hut: string
 date: date
 arr_data:array
     element:map
         key:string
         value:string

Original arr_data column's Row looks like this,

arr_data:
[
{'Id': '12a', 'Q_Id': 'uac', 'Q_Type': 'action'},
{'Id': '', 'Q_Id': '', ''},
{'Id': '76v', 'Q_Id': '', 'Q_Type': 'form'}
]

I tried the following,

df = df.withColumn("arr_data_map", f.array(f.create_map(
        f.lit("Id"), f.col("arr_data.Id"),
        f.lit("Q_Id"), f.col("arr_data.Q_Id"),
        f.lit("Q_Type"), f.col("arr_data.Q_Type")
    )))

I get the following result,

[
    {'Id': ['12a', '', '76v']},
    {'Q_Id': ['uac', '','']},
    {'Q_Type': ['action', '', 'form']}
]

This is not what I want. I want the original arr_data with the Map schema as mentioned above. How can I achieve this?

Below to create a sample df (original) with schema that has array(struct),

data = [
    ('A', datetime.date(2022, 1, 1), [{'Id': '12a', 'Q_Id': 'uac', 'Q_Type': 'action'},
                         {'Id': '', 'Q_Id': '', 'Q_Type': ''},
                         {'Id': '76v', 'Q_Id': '', 'Q_Type': 'form'}]),
    ('B', datetime.date(2022, 1, 2), [{'Id': '34b', 'Q_Id': 'abc', 'Q_Type': 'action'},
                         {'Id': '56c', 'Q_Id': 'def', 'Q_Type': 'form'},
                         {'Id': '78d', 'Q_Id': 'ghi', 'Q_Type': 'action'}])
]

# Define the schema
schema = t.StructType([t.StructField("g_hut", t.StringType()),
                       t.StructField("date", t.DateType()),
                       t.StructField("arr_data", t.ArrayType(
                           t.StructType([
                               t.StructField("Id", t.StringType()),
                               t.StructField("Q_Id", t.StringType()),
                               t.StructField("Q_Type", t.StringType())]))
                                     )
                       ])

# Create a DataFrame
df = spark.createDataFrame(data, schema=schema)

Solution

  • Apply a higher-order transformation function to transform each struct inside the array to the corresponding map representation:

    out = df.withColumn(
        'arr_data',
        F.transform(
            'arr_data',
            lambda x: F.create_map(
                F.lit('Id'), x.Id,
                F.lit('Q_Id'), x.Q_Id,
                F.lit('Q_Type'), x.Q_Type
            )
        )
    )
    

    out.show()
    

    Output:

    +-----+----------+----------------------------------------------------------------------------------------------------------------------------------+
    |g_hut|date      |arr_data                                                                                                                          |
    +-----+----------+----------------------------------------------------------------------------------------------------------------------------------+
    |A    |2022-01-01|[{Id -> 12a, Q_Id -> uac, Q_Type -> action}, {Id -> , Q_Id -> , Q_Type -> }, {Id -> 76v, Q_Id -> , Q_Type -> form}]               |
    |B    |2022-01-02|[{Id -> 34b, Q_Id -> abc, Q_Type -> action}, {Id -> 56c, Q_Id -> def, Q_Type -> form}, {Id -> 78d, Q_Id -> ghi, Q_Type -> action}]|
    +-----+----------+----------------------------------------------------------------------------------------------------------------------------------+