I see so many example which need to use lambda over a rdd.map .
just wonder if we can do something like the following :
df.withColumn('newcol',(lambda x: x['col1'] + x['col2'])).show()
You'll have to wrap it in a UDF and provide columns which you want your lambda to be applied on.
Example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
data = [{"a": 1, "b": 2}]
df = spark.createDataFrame(data)
df.withColumn("c", F.udf(lambda x, y: x + y)("a", "b")).show()
Result:
+---+---+---+
| a| b| c|
+---+---+---+
| 1| 2| 3|
+---+---+---+