I am trying to evaluate the following condition to add a column in Spark dataframe.
@pandas_udf(returnType=StringType())
def test(email, headers):
if email is not None:
return email
else:
return = headers.str.get("default")
What's the best way to check if email is null? I have tried several options but nothing works.
res = df.withColumn("out", test(col("email"), col("headers"))
Even if col("name")
is null the else condition is not evaluated.
You'll be better of using built-in Spark SQL functions than defining a UDF, it will be much more performant.
Something like this works:
import pyspark.sql.functions as F
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(
[
("[email protected]", "header1"),
("[email protected]", "header2"),
(None, "header3"),
("[email protected]", "header4"),
(None, "header5"),
],
["email", "headers"],
)
df.withColumn(
"out", F.when(F.col("email").isNull(), F.col("headers")).otherwise(F.col("email"))
).show()
+------------------+-------+------------------+
| email|headers| out|
+------------------+-------+------------------+
|[email protected]|header1|[email protected]|
|[email protected]|header2|[email protected]|
| null|header3| header3|
|[email protected]|header4|[email protected]|
| null|header5| header5|
+------------------+-------+------------------+
The when
function gives you the "if" functionality you're looking for.