Search code examples
pythonapache-sparkpysparkproject-structuremodular

Pyspark: Is it required to pass additional modules as --py-files argument in a project


I am creating a pyspark application which is modular in nature. My code struture is like:

├── main.py
├── src
│   ├── __init__.py
│   ├── jobs
│   │   ├── __init__.py
│   │   └── logic.py
│   └── utils
│       ├── __init__.py
│       └── utility.py

My start script is main.py which in turn call logic function in the logic.py file.

I am running my project like spark-submit main.py

My question is do i need to mention other .py files in the spark-submit command or they automaticlly get imported.

I come accross a post which mention to zip the src folder and pass it as argument in --py-files.

Which is the right way?

Should i keep the current structure and run code from main.py like i do.?

Is there any difference in these two ways? (logical and performance wise)


Solution

  • When running locally there is no need to pass additional modules as zip with the --py-files flag, your code is local and so is the master and workers (they all have access to your code and modules necessary).

    However, when you want to submit a job to a cluster, the master and the workers need to have access to your main.py file, along with all the modules it uses, thus, using the --py-files argument, you are specifying the location of the extra modules and both master and workers have access to every part of the code that needs to be run. If you just run spark-submit main.py on a cluster, it won't work because 1) the location of main.py is relative to your system, so the cluster won't be able to locate main.py and 2) due to ImportErrors of the main.py.

    Note: The usage of this flag is before specifying the main.py and the zipped files (as well as main.py) need to be somewhere accessible to the whole cluster, not local on your machine, e.g. on an ftp server. For example, to submit on a cluster through mesos:

    spark-submit --master mesos://path/to/service/spark --deploy-mode cluster --py-files http://somedomainforfileserving/src.zip  http://somedomainforfileserving/main.py
    

    Edit: As for jar dependencies, e.g. the ElasticSearch connector, you can put the jars within the src, e.g. in src/jars, so that it gets zipped and distributed to all, and then when submitting to your cluster, reference the relative to src path to the jar. E.g.:

    spark-submit --master mesos://path/to/service/spark --deploy-mode cluster --jars src/jars/elasticsearch-spark-someversion.jar --py-files http://somedomainforfileserving/src.zip  http://somedomainforfileserving/main.py