Pyspark code to remove a column within a complex Json schema

Team,

I am a beginner in Pyspark programming

I am trying to remove the "ProductEntityCode" from the array column present within the schema below. This column is nested within the ProductSaleData which is an array and this array is nested within the EventPayload which is a Struct.

Can anyone tell me how to remove this nestled ProductEntityCode while keeping the other columns intact. Will be really helpful.

root

 |-- EventId: string (nullable = true)
 |-- EventPayload: struct (nullable = true)
 |    |-- AccountId: long (nullable = true)
 |    |-- ProductSaleData: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- PortionId: long (nullable = true)
 |    |    |    |-- PortionName: string (nullable = true)
 |    |    |    |-- **ProductEntityCode: long (nullable = true)**
 |    |    |    |-- ProductReference: string (nullable = true)
 |    |    |    
 |    |-- SiteId: long (nullable = true)
 |    |-- SiteReference: string (nullable = true)
 |   
 |-- EventProcessedUtcTime: string (nullable = true)

Solution

You can use withField to update a child field of a column and use transform to modify the element in an array. To drop a field in a struct, use dropFields.

df = (df.withColumn("EventPayload", 
                    (F.col("EventPayload")
                     .withField("ProductSaleData", 
                                F.transform("EventPayload.ProductSaleData", 
                                            lambda x: x.dropFields("ProductEntityCode"))))
                   )
)