Search code examples
pythonjavadockerapache-sparkpyspark

Pyspark Exceptions : [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number


Yes, I know I'm going to be told it's a duplicate, but it's not.

When I try to connect to my Apache Spark via my Bitnami/Spark container, I always get this error.

Error : unable to find or load main class org.apache.spark.deploy.SparkSubmit
Caused by : java.lang.ClassNotFoundException: org.apache.spark.deploy.SparkSubmit
Traceback (most recent call last):
  File "d:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\src\main.py", line 40, in <module>
    spark = SparkSession.builder.master(f'spark://spark-master:7077').appName("SimpleApp").getOrCreate()
  File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\sql\session.py", line 497, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\context.py", line 515, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\context.py", line 201, in __init__
    SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
  File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\context.py", line 436, in _ensure_initialized
    SparkContext._gateway = gateway or launch_gateway(conf)
  File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\java_gateway.py", line 107, in launch_gateway
    raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

As I've seen in several places on the internet,

  • At first, i tried with jdk8 but it didn't work

  • I uninstalled JAVA then installed jdk 11,

  • I also added JAVA_HOME to my path and when I type java -version, I get : openjdk version "11.0.22" 2024-01-16 OpenJDK Runtime Environment Temurin-11.0.22+7 (build 11.0.22+7) OpenJDK 64-Bit Server VM Temurin-11.0.22+7 (build 11.0.22+7, mixed mode) as answer.

  • I was careful to align my PySpark version with that of my Bitnami Spark Cluster in 3.5.0.

  • My python version is Python 3.12.1

  • The ports on my spark master are 8080 for the webUI and 7077 for the Spark.

Here's the code I'm trying to run

from pyspark.sql import SparkSession
from pyspark.conf import SparkConf

from pyspark.sql.types import IntegerType
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType

if __name__ == "__main__":
    
    conf = SparkConf()
    conf.setAll(
        [
            (
                "spark.master",
                "spark://spark-master:7077"
            ),
            ("spark.driver.host", "local[*]"),
            ("spark.submit.deployMode", "client"),
            ("spark.driver.bindAddress", "0.0.0.0"),
            ("spark.app.name", "HelloWorld"),
        ] 
    )
    
    spark = SparkSession.builder.config(conf=conf).getOrCreate()


    df = spark.createDataFrame([("Hello World",)], ["greeting"])
    df.show()

    spark. Stop()

In the pySpark config I tried for "spark.master" :

  • spark://127.0.1.1:7077
  • spark://localhost:7077
  • spark://spark-master:7077
  • spark://host.docker.internal:7077

And have also tried for "spark.driver.host" :

  • The IP of my machine on my internal network
  • Container IP
  • 127.0.0.1
  • localhost

I really hope you can help me, I used this video to set everything up: https://www.youtube.com/watch?v=luiJttJVeBA.


Solution

  • Forget it, I've finally found my trouble, was a PATH problem because of venv. When I did $JAVA_HOME in my python environment, I had the variable return, but I don't know why the script couldn't use it. Everything worked itself out when I launched directly from my terminal without venv.