Search code examples
pythonapache-sparkpysparkhadoop-yarn

No module error when running spark-submit


I'm submitting a python file which depends on custom modules to run. The file I'm trying to submit is located at project/main.py and our modules are located at project/modules/module1.py. I'm submitting to Yarn in client mode and receiving the following error.

ModuleNotFoundError: No module named 'modules.module1'

The import statement in main.py:

from modules import module1.py

I have tried zipping the modules folder and passing it to --py-files:

spark-submit --master yarn --queue OurQueue --py-files hdfs://HOST/path/to/modules.zip
--conf "spark.pyspark.driver.python=/hadoop/anaconda3.6/bin/python3"
--conf "spark.pyspark.python=/hadoop/anaconda3.6/bin/python3"
main.py

Solution

  • Assuming you have a zip file made as

    zip -r modules
    

    I think that you are missing to attach this file to spark context, you can use addPyFile() function in the script as

      sc.addPyFile("modules.zip")
    

    Also, Dont forget to make make empty __init__.py file at root level in your directory(modules.zip) like modules/__init__.py )

    Now to Import, I think you can import it as

     from modules.module1 import *
    

    or

     from modules.module1 import module1
    

    Updated, Now run the spark-submit command as

    spark-submit --master yarn --queue OurQueue --py-files modules.zip
    --conf "spark.pyspark.driver.python=/hadoop/anaconda3.6/bin/python3"
    --conf "spark.pyspark.python=/hadoop/anaconda3.6/bin/python3"
    main.py