python apache-spark pyspark apache-spark-sql user-defined-functions

Is Python UDF still inefficient in Spark?

I'm reading the Book: Spark: "The Definitive Guide: Big Data Processing Made Simple" which came out in 2018, and now is 2023, so the book mentioned that using UDFs written on Python aren't efficient, same with using Python code on RDD's, is still that true?

Solution

Old knowledge but still applicable as per below:

UDF's are slower in general as Catalyst cannot optimize them - they are blackboxes. See https://www.codemotion.com/magazine/ai-ml/big-data/light-up-the-spark-in-catalyst-by-avoiding-udf/
For python vs. Scala UDF's, yes python is slower. Reading this https://medium.com/quantumblack/spark-udf-deep-insights-in-performance-f0a95a4d8c62 gives a good insight on things. I use Scala when I assist, so this is not my key area but, summarizing an excellent post:
- Uses PySpark UDFs when the data volume is not big or need quick insights using simpler functions.
- Build an internal library with re-usable Scala UDFs
- Create Python wrappers to call Scala UDFs (interesting!)
On RDDs, they cannot be optimized by Catalyst for pyspark or Scala. Should be using DataFrames, DataSets.