Search code examples
apache-sparkpyspark

Drop a column in a nested structure


I have the following schema.

|--items : array
   |-- element : struct
       |-- id : long
       |-- value : double        
       |-- stock : array
           |-- element : string

I’m trying to drop stock column from my schema, my desired output is:

|--items : array
   |-- element : struct
       |-- id : long
       |-- value : double        

I’ve tried to drop the column using the following codes:

df = df.withColumn(‘items’, F.col(‘items’).dropFields(‘stock’)

This gives me the following error:

Parameter 1 requires “STRUCT” type, however “items” has type “Array<Struct…”

I also tried

df = df.withColumn(“items”, F.col(“items”).cast(cast)

Note: My cast here is a schema without the stock, but I got the following error:

Cannot resolve “items” due to data type mismatch: cannot cast “source schema…” to “desired schema…”

So, my doubt is, how can I drop the stock column to get my desired output?


Solution

  • dropFields requires a column of struct type but in your case you have column which contains array of structs. The solution is to apply a transform function on each struct inside array and drop the corresponding field

    df = df.withColumn('items', F.transform('items', lambda x: x.dropFields('stock')))
    

    root
     |-- items: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: long (nullable = true)
     |    |    |-- value: long (nullable = true)