Search code examples
apache-sparkpysparkuser-defined-functionsrdd

Apply different functions to many columns of a pyspark dataframe


I have a pyspark dataframe with few columns

col1    col2    col3
---------------------
1.      2.1.    3.2
3.2.    4.2.    5.1

and I would like to apply three functions f1(x), f2(x), f3(x) each one to the correspondent column of the dataframe, so that I get

col1        col2        col3
-------------------------------
f1(1.)      f2(2.1.)    f3(3.2)
f1(3.2.)    f2(4.2.)    f3(5.1)

I am trying to avoid defining a udf for each column, so my idea would be to build an rdd from each column applying a function (maybe zip with an index, which I could define in the original dataset too), then to join back to the original dataframe.

Is it a viable solution, or is there a way to do it better?

UPDATE: following @Andre' Perez suggestion I could define a udf per each column and use spark sql to apply it or alternatively

import numpy as np
import pyspark.sql.functions as F
f1_udf = F.udf(lambda x: float(np.sin(x)), FloatType())
f2_udf = F.udf(lambda x: float(np.cos(x)), FloatType())
f3_udf = F.udf(lambda x: float(np.tan(x)), FloatType())


df = df.withColumn("col1", f1_udf("col1"))
df = df.withColumn("col2", f2_udf("col2"))
df = df.withColumn("col3", f3_udf("col3"))

Solution

  • Maybe it is better to register those functions as a UDF (even though you said you don't want to follow this approach).

    spark.udf.register("func1", f1)
    spark.udf.register("func2", f2)
    spark.udf.register("func3", f3)
    

    I would then register the DataFrame as a temporary view and run a Spark SQL query on it with the registered functions.

    df.createOrReplaceTempView("dataframe")
    df2 = spark.sql("select func1(col1), func2(col2), func3(col3) from dataframe")