Search code examples
pythonapache-sparkpysparkpycharm

Spark/pyspark on same version but "py4j.Py4JException: Constructor org.apache.spark.api.python.PythonFunction does not exist"


I have a properly sync'ed pyspark client / spark installation: both versions are 3.3.1 [ shown below]. The full exception message is:

py4j.Py4JException: Constructor org.apache.spark.api.python.PythonFunction([class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.String, class java.lang.String, class java.util.ArrayList, class org.apache.spark.api.python.PythonAccumulatorV2]) does not exist

This has been identified in another SOF post as most likely due to versioning mismatch between the pyspark invoker/caller and the spark backend. I agree that would seem the likely cause: but then I have verified carefully that both sides of the equation are equal:

pyspark and spark are same versions:

Python 3.10.13 (main, Aug 24 2023, 22:48:59) [Clang 14.0.3 (clang-1403.0.22.14.1)]

In [1]: import pyspark

In [2]: print(f"PySpark version: {pyspark.__version__}")
PySpark version: 3.3.1

Spark was installed by downloading the version 3.3.1 .tgz directly from the apache site and unzip/tar-ring. The SPARK_HOME was pointed to that directory and the $SPARK_HOME/bin added to the path.

$spark-shell --version

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Inside the python script the version has been verified as well:

pyspark version: 3.3.1

But the script blows up with a pyspark / spark error

An error occurred while calling None.org.apache.spark.api.python.PythonFunction

py4j.Py4JException: Constructor org.apache.spark.api.python.PythonFunction([class [B, class java.util.HashMap, class java.util.ArrayList, class java.lang.String, class java.lang.String, class java.util.ArrayList, class org.apache.spark.api.python.PythonAccumulatorV2]) does not exist at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:180)

So .. what else might be going on here? Is there some way I'm not seeing in which the versions of spark/pyspark might be out of sync?


Solution

  • pycharm situation. Looks like I had not restarted it after twiddling between versions of spark. It remembered an earlier version of the default (for homebrew) of 3.5.0