Search code examples
dataframepysparkdatabricks

Pyspark code to remove a column within a complex Json schema


Team,

I am a beginner in Pyspark programming

I am trying to remove the "ProductEntityCode" from the array column present within the schema below. This column is nested within the ProductSaleData which is an array and this array is nested within the EventPayload which is a Struct.

Can anyone tell me how to remove this nestled ProductEntityCode while keeping the other columns intact. Will be really helpful.

root

 |-- EventId: string (nullable = true)
 |-- EventPayload: struct (nullable = true)
 |    |-- AccountId: long (nullable = true)
 |    |-- ProductSaleData: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- PortionId: long (nullable = true)
 |    |    |    |-- PortionName: string (nullable = true)
 |    |    |    |-- **ProductEntityCode: long (nullable = true)**
 |    |    |    |-- ProductReference: string (nullable = true)
 |    |    |    
 |    |-- SiteId: long (nullable = true)
 |    |-- SiteReference: string (nullable = true)
 |   
 |-- EventProcessedUtcTime: string (nullable = true)
 

Solution

  • You can use withField to update a child field of a column and use transform to modify the element in an array. To drop a field in a struct, use dropFields.

    df = (df.withColumn("EventPayload", 
                        (F.col("EventPayload")
                         .withField("ProductSaleData", 
                                    F.transform("EventPayload.ProductSaleData", 
                                                lambda x: x.dropFields("ProductEntityCode"))))
                       )
    )