Search code examples
apache-sparkpyspark

PySpark performance of using Python UDF vs Pandas UDF


My understanding is Pandas UDF uses Arrow to reduce data serialization overhead and it also supports vector-based calculation. So, Pandas UDF should have better performance than Python UDF, but the below code snippet shows the opposite. Any reason why? Or I did something wrong?

from time import perf_counter

import pandas as pd

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *

spark = SparkSession.builder.appName("TEST").getOrCreate()

sdf = spark.range(0, 1000000).withColumn(
  'id', col('id')
).withColumn('v', rand())

@pandas_udf(DoubleType())
def pandas_plus_one(pdf):
    return pdf + 1

@udf(DoubleType())
def plus_one(num):
    return num + 1

# Pandas UDF
res_pdf = sdf.select(pandas_plus_one(col("v")))
st = perf_counter()
for _ in range(10):
    res_pdf.show()
print(f"Pandas UDF Time: {(perf_counter() - st) * 1000} ms")

# Python UDF
res = sdf.select(plus_one(col("v")))
st = perf_counter()
for _ in range(10):
    res.show()
print(f"Python UDF Time: {(perf_counter() - st) * 1000} ms")

Solution

  • To answer my own question, show() by default only shows first 20 rows. In this case, only 20 rows out of 1M are passed to UDF and computed by the UDF. Due to that, the setup overhead dominates and setup cost for Pandas UDF is much higher than Python UDF.

    This is a bit strange and non-intuitive optimization for Spark. As a user, I would expect that 1M rows are passed to the UDF and computed, and then only 20 results are displayed.