Team,
I am a beginner in Pyspark programming
I am trying to remove the "ProductEntityCode" from the array column present within the schema below. This column is nested within the ProductSaleData which is an array and this array is nested within the EventPayload which is a Struct.
Can anyone tell me how to remove this nestled ProductEntityCode while keeping the other columns intact. Will be really helpful.
root
|-- EventId: string (nullable = true)
|-- EventPayload: struct (nullable = true)
| |-- AccountId: long (nullable = true)
| |-- ProductSaleData: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- PortionId: long (nullable = true)
| | | |-- PortionName: string (nullable = true)
| | | |-- **ProductEntityCode: long (nullable = true)**
| | | |-- ProductReference: string (nullable = true)
| | |
| |-- SiteId: long (nullable = true)
| |-- SiteReference: string (nullable = true)
|
|-- EventProcessedUtcTime: string (nullable = true)
You can use withField
to update a child field of a column and use transform
to modify the element in an array. To drop a field in a struct, use dropFields
.
df = (df.withColumn("EventPayload",
(F.col("EventPayload")
.withField("ProductSaleData",
F.transform("EventPayload.ProductSaleData",
lambda x: x.dropFields("ProductEntityCode"))))
)
)