How to solve the runtime error: graphframes not found

I used the graphframes framework in pyspark, which was normal for a while to run (I had used the graphframes module), but after a while I got an error: "No module named 'graphframes' ".

This kind of error is occasionally, sometimes he can complete the run, sometimes not.

pyspar-version:2.2.1

graphframe:0.6

error:

19/06/05 02:22:17 ERROR Executor: Exception in task 641.3 in stage 216.0 (TID 123244)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/worker.py", line 166, in main
   func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/worker.py", line 55, in read_command
    command = serializer._read_with_length(file)
  File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
    return self.loads(obj)
  File "/appcom/spark-2.2.1/python/lib/pyspark.zip/pyspark/serializers.py", line 455, in loads
    return pickle.loads(obj, encoding=encoding)
  File "/data/data08/nm-local-dir/usercache/hduser0011/appcache/application_1547810698423_82435/container_1547810698423_82435_02_000041/ares_detect.zip/ares_detect/task/communication_detect.py", line 11, in <module>
    from graphframes import GraphFrame
ModuleNotFoundError: No module named 'graphframes'

command：

spark-submit --master yarn-cluster \
        --name ad_com_detect_${app_arr[$i]}_${scenario_arr[$i]}_${txParameter_app_arr[$i]} \
        --executor-cores 4 \
        --num-executors 8 \
        --executor-memory 35g \
        --driver-memory 2g \
        --conf spark.sql.shuffle.partitions=800 \
        --conf spark.default.parallelism=1000 \
        --conf spark.yarn.executor.memoryOverhead=2048 \
        --conf spark.sql.execution.arrow.enabled=true \
        --jars org.scala-lang_scala-reflect-2.10.4.jar,\
org.slf4j_slf4j-api-1.7.7.jar,\
com.typesafe.scala-logging_scala-logging-api_2.10-2.1.2.jar,\
com.typesafe.scala-logging_scala-logging-slf4j_2.10-2.1.2.jar,\
graphframes-0.6.0-spark2.2-s_2.11.jar \
        --py-files ***.zip \
***/***/****.py  &

Does pyspark remove these jars when it runs out of memory?

Solution

Try to add the jar via the package command.

spark-submit \
    --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11  \
      my_py_script.py

it also works with both parameters at the same time

spark-submit \
    --packages graphframes:graphframes:0.7.0-spark2.4-s_2.11  \
    --jars patth_to_your_jars/graphframes-0.7.0-spark2.4-s_2.11.jar \
    my_py_script.py

This solved the issue for me

In general there are 4 commands to add files to Spark the commands is explained in the spark-submit --help

--jars JARS            Comma-separated list of jars to include on the driver and executor classpaths.

--packages             Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories.

--py-files PY_FILES    Comma-separated list of .zip, .egg, or .pyfiles to place on the PYTHONPATH for Python apps.

--files FILES          Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName).