Spark Dataframe lambda on dataframe directly

I see so many example which need to use lambda over a rdd.map .
just wonder if we can do something like the following :

df.withColumn('newcol',(lambda x: x['col1'] + x['col2'])).show()

Solution

You'll have to wrap it in a UDF and provide columns which you want your lambda to be applied on.

Example:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

if __name__ == "__main__":
    spark = SparkSession.builder.getOrCreate()
    data = [{"a": 1, "b": 2}]
    df = spark.createDataFrame(data)
    df.withColumn("c", F.udf(lambda x, y: x + y)("a", "b")).show()

Result:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
+---+---+---+

XPath Query Returns Lists Omitting Missing Values Instead of Including None
Pyspark on GCP Dataproc - Partial reading of data for gzip encoded Cloud Storage files
How to reinstall same version of a wheel on Databricks without cluster restart
Pyenv - Switching between Python and PySpark versions without hardcoding environment variable paths for python
How to handle accented letter in Pyspark
Pyspark Streaming data to Elastic search index from Kafka topic , running in Jupyter notebook, causing failure
How to handle an AnalysisException on Spark SQL?
Share cluster params between jobs
How convert a list into multiple columns and a dataframe?
PySpark Window functions: Aggregation differs if WindowSpec has sorting
Add quote for pyspark dataframe column with regular expressions
Using rangeBetween considering months rather than days in PySpark
Pyspark replace strings in Spark dataframe column
How to specify file size using repartition() in spark
Spark reading from mutiple SQL databases in parallel
Spark partition size greater than the executor memory
Last day of quarter
polars groupby and pivot converting code from pyspark
corrupted record from json file in pyspark due to False as entry
Chain several WHEN conditions in a scalable way in PySpark
How to compute the total number of words in a text file
Not able to select more than 255 columns from Pyspark DataFrame
In Foundry, how to read "added" rows in output since the last built?
Rank on a subset of a partition - PySpark
How to extract all elements from array of structs?
How to get the schema definition from a dataframe in PySpark?
Get index of item in array column in a Spark dataframe
pyspark overwrite silently failed to remove stale parquet files
Great expectation: get invalid records
How convert CSV table structure to JSON using Python?