Search code examples
pysparkuser-defined-functionscross-join

Hive UDF vs. PySpark UDF in Cross Join


I need to run some UDF on a crossed joined dataset in PySpark. I think I can do this in two steps: 1> Do cross join first 2> Run UDF on the result from 1st step.

In Hive this can be done in one step by running UDF together with CROSS JOIN. Maybe Hive does this in two steps internally like PySpark (assuming my understanding is correct)? Or is there a way to do the same in PySpark?


Solution

  • The core part of Spark is implemented in Java and Scala. No matter Spark Scala APIs, Spark SQL, or PySpark you use, the main processing are working within JVM.

    If you use non-native UDFs like Python UDFs, it requires internally extra steps, including serialize input data of UDF, move the data to Python, deserialize and run UDFs in Python. Then, the data will be moved back to JVM too. As far as I know, there is no way to avoid this.