OK, so I am running a script that depends on a complicated project with a bunch of custom submodules out of pyspark. The job I am running is one where I would like for it to have several different versions of code running against a Spark standalone instance.
So, I need to put my project on my PYTHONPATH on every worker for it to work. This works fine if I add the source for my project to PYTHONPATH, and THEN start the standalone cluster. If I edit the PYTHONPATH, then the runtime code will only refer to what happened at the startup time, not what was valid at the time when I ran spark-submit.
The reason why this is important is because I want to be able to run jobs against multiple versions of code, which means that I would like to be able to load different versions of the code dynamically. Doing things like zipping my source and doing sc.addPyFile() in my script does not work, either.
Is there a way to dynamically change the python code on my path in-between spark-submit jobs without restarting my standalone cluster?
The easiest way is to modify sys.path
before importing the module. For example:
import sys
sys.path.insert(0, '/path/to/module/you/want/to/use/this/time/modulename')
import modulename
But remember, that this path must exist on all your worker nodes, spark will not copy the libraries for you.
If you need to change sys.path
after importing module you will need to use reload/imp/ipmortlib
(depending on what version of python do you use)