Yes, I know I'm going to be told it's a duplicate, but it's not.
When I try to connect to my Apache Spark via my Bitnami/Spark container, I always get this error.
Error : unable to find or load main class org.apache.spark.deploy.SparkSubmit
Caused by : java.lang.ClassNotFoundException: org.apache.spark.deploy.SparkSubmit
Traceback (most recent call last):
File "d:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\src\main.py", line 40, in <module>
spark = SparkSession.builder.master(f'spark://spark-master:7077').appName("SimpleApp").getOrCreate()
File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\sql\session.py", line 497, in getOrCreate
sc = SparkContext.getOrCreate(sparkConf)
File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\context.py", line 515, in getOrCreate
SparkContext(conf=conf or SparkConf())
File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\context.py", line 201, in __init__
SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\context.py", line 436, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway(conf)
File "D:\Users\azeve\Desktop\École\MT4\DEVOPS\RENDU_DEVOPS_MT4\.venv\lib\site-packages\pyspark\java_gateway.py", line 107, in launch_gateway
raise PySparkRuntimeError(
pyspark.errors.exceptions.base.PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.
As I've seen in several places on the internet,
At first, i tried with jdk8 but it didn't work
I uninstalled JAVA then installed jdk 11,
I also added JAVA_HOME to my path and when I type java -version, I get :
openjdk version "11.0.22" 2024-01-16 OpenJDK Runtime Environment Temurin-11.0.22+7 (build 11.0.22+7) OpenJDK 64-Bit Server VM Temurin-11.0.22+7 (build 11.0.22+7, mixed mode)
as answer.
I was careful to align my PySpark version with that of my Bitnami Spark Cluster in 3.5.0.
My python version is Python 3.12.1
The ports on my spark master are 8080 for the webUI and 7077 for the Spark.
Here's the code I'm trying to run
from pyspark.sql import SparkSession
from pyspark.conf import SparkConf
from pyspark.sql.types import IntegerType
from pyspark.sql.types import StructField
from pyspark.sql.types import StructType
if __name__ == "__main__":
conf = SparkConf()
conf.setAll(
[
(
"spark.master",
"spark://spark-master:7077"
),
("spark.driver.host", "local[*]"),
("spark.submit.deployMode", "client"),
("spark.driver.bindAddress", "0.0.0.0"),
("spark.app.name", "HelloWorld"),
]
)
spark = SparkSession.builder.config(conf=conf).getOrCreate()
df = spark.createDataFrame([("Hello World",)], ["greeting"])
df.show()
spark. Stop()
In the pySpark config I tried for "spark.master" :
And have also tried for "spark.driver.host" :
I really hope you can help me, I used this video to set everything up: https://www.youtube.com/watch?v=luiJttJVeBA.
Forget it, I've finally found my trouble, was a PATH problem because of venv. When I did $JAVA_HOME in my python environment, I had the variable return, but I don't know why the script couldn't use it. Everything worked itself out when I launched directly from my terminal without venv.