when passing inputs into a pandas_udf in pyspark, something you use col("name"), sometimes you use "name" directly. Is there a difference? also, can somebody point me to the exact line in the documentation that references this two usages is allowed? I know that both approaches work, but I am struggling to convince myself hence the need to search for the corresponding documentation.
thank you
(example extracted from one of the databricks tutorial)
from pyspark.sql.functions import pandas_udf
@pandas_udf("double")
def predict(*args: pd.Series) -> pd.Series:
model_path = f"runs:/{run.info.run_id}/model"
model = mlflow.sklearn.load_model(model_path) # Load model
pdf = pd.concat(args, axis=1)
return pd.Series(model.predict(pdf))
prediction_df = spark_df.withColumn("prediction", predict(spark_df.columns))
display(prediction_df)
Consider this function :
def add(A, B):
return A + B
In Python, irrespective of the values assigned to variables A and B, the function will be executed. It may encounter failures if the input objects do not match the expected types, but the function itself will be invoked regardless.
In Scala, the language underlying Spark, functions with the same name, like the following:
def add(x: Int, y: Int): Int = x + y
def add(x: Double, y: Double): Double = x + y
exist simultaneously as distinct functions. When using Spark functions, you are actually calling Scala functions. Some are defined with a "string" as input (representing the column name), while others are defined with a "column" object as input (F.col("col_name")
). In most cases, both versions are defined, allowing you to interchangeably use either, as they are functionally identical.
The form function("column_name")
is designed for non-developers, especially data scientists.
On the other hand, function(col("column_name"))
is more object-oriented and intended for developers.