Search code examples
javapython-3.xapache-sparkpysparkcmd

Starting with PySpark and having problems with simple code


I'm new with PySpark and tried a simple code like that

from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName('Read File')
sc = SparkContext.getOrCreate(conf=conf)

rdd = sc.textFile('data1.txt')
print(rdd.collect())

rdd2 = rdd.map(lambda x: x.split(' '))
print(rdd2.collect())

but the rdd2.collect() execution always give me problems like:

ERROR Executor: Exception in task 1.0 in stage 2.0 (TID 5)/ 2]
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)

I have this versions installed all local and executing in Windows 10 with cmd.exe:

  • Python 3.12.1
  • Java 11.0.20
  • Spark 3.5.0
  • Hadoop 3.3.6

I also have declared all the environment variables, JAVA_HOME, SCALA_HOME, HADOOP_HOME, SPARK_HOME, PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON. Last 2 with the route to python.exe on the Python instalation route.

I tried to uninstall and reinstall everything, change versions, change environment variables but I do not know what to do now.


Solution

  • Finally I have finished using docker with the specific image of pyspark in jupyter "https://hub.docker.com/r/jupyter/pyspark-notebook" and it works correctly without problems, in case there is someone interested that has this kind of problems you can also follow this guide "https://medium.com/@suci/running-pyspark-on-jupyter-notebook-with-docker-602b18ac4494" and this "https://subhamkharwal.medium.com/data-lakehouse-with-pyspark-setup-pyspark-docker-jupyter-lab-env-1261a8a55697".

    Thanks to this I was able to install docker easily and load the "jupyter/pyspark-notebook" image correctly so everything works as it should.