Search code examples
pysparkazure-databricks

Spark Dataframe lambda on dataframe directly


I see so many example which need to use lambda over a rdd.map .
just wonder if we can do something like the following :

df.withColumn('newcol',(lambda x: x['col1'] + x['col2'])).show()

Solution

  • You'll have to wrap it in a UDF and provide columns which you want your lambda to be applied on.

    Example:

    from pyspark.sql import SparkSession
    import pyspark.sql.functions as F
    
    if __name__ == "__main__":
        spark = SparkSession.builder.getOrCreate()
        data = [{"a": 1, "b": 2}]
        df = spark.createDataFrame(data)
        df.withColumn("c", F.udf(lambda x, y: x + y)("a", "b")).show()
    

    Result:

    +---+---+---+
    |  a|  b|  c|
    +---+---+---+
    |  1|  2|  3|
    +---+---+---+