Search code examples
pythonapache-sparkpysparkzip

Accessing the user defined modules in Pyspark Shell (ModuleNotFoundError: No module named)


Normally we do a spark-submit with the zip file spark-submit --name App_Name --master yarn --deploy-mode cluster --archives /<path>/myzip.zip#pyzip /<path>/Processfile.py and access them in the py files using from dir1.dir2.dir3.module_name import module_name and the module import works fine.

When I try to do the same in pyspark shell, it gives me a module not found error. pyspark --py-files /<path>/myzip.zip#pyzip

How can the modules be accessed in the spark shell.


Solution

  • Was able to finally import the modules in the Pyspark shell, the ZIP that I am passing has all the dependency modules installed into a virtual environment in Python and made as a ZIP.

    So in such cases going virtual and then starting the Pyspark shell did the trick.

    source bin/activate
    pyspark --archives <path>/filename.zip
    

    This didn't require me to add the pyfiles to the sparkContext too.