Search code examples
apache-sparkpyspark

Apply lambda function in a nested field Spark


I have the following schema:

--items : array
   |-- element : struct
       |-- id : long
       |-- value : double        
       |-- stock : struct
           |-- centerList : array
               |-- element : struct
                   |-- center : struct

I can iterate over items using

df = df.withColumn(‘items’, F.transform(‘items’, lambda x : x.dropFields(‘value’)

But how can I iterate over centerList? I’ve tried:

df = df.withColumn(‘items’, F.transform(‘items’, lambda x : x.withField(‘stock.centerList’, F.transform(‘centerList’, lambda y: y.drop(‘center’)))

But when I run my code I got the following error:

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name ‘centerList’ cannot be resolved.

Solution

  • Your logic is correct but you need to properly reference the centerList column of the current stock in the inner transformation operation

    def transform(col):
        return F.transform(
            col,
            lambda x: x.withField(
                'stock.centerList',
                F.transform(
                    x['stock']['centerList'], 
                    lambda y: y.dropFields('center')
                )
            )
        )
    
    df = df.withColumn('items', transform('items'))