python apache-spark pyspark anaconda hadoop-yarn

Where is executed python script when use spark-submit?

Python : 3.7.3
OS: CentOS 7
Spark: 2.2.0
In Cloudera
YARN : 2.6.0-cdh5.10.2

Hi, I tried to executed Apache Spark with python scripts with pyspark, but I dont understand how it works the workflow. I try to send a whole conda enviroment with yarn in client mode with the argument --archives when executed the spark-submit. But the question is, where is the main python script running, because I need to specify the location of my shared conda enviroment to executed without errors, because in the host that I try to exectued the spark-submit I havent the dependencies installed, and i dont want to install it.

I use this feature to pack the enviroment https://conda.github.io/conda-pack/spark.html, and I need to import the dependencies outside of a map (because inside a map, the yarn shipped the dependencies and the executors import well this dependencies).

There are a way to execute the main python script with the shipped enviroment without open and using on the host?

my envs are:

PYSPARK_DRIVER_PYTHON=./enviroment/bin/python
PYSPARK_PYTHON=./enviroment/bin/python

where enviroment is the symbolic link of dependencies shipped with yarn

--archives ~/dependencies.tar.gz#enviroment

And configure the executors with

--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python

So the final command is

PYSPARK_DRIVER_PYTHON=./enviroment/bin/python \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn --deploy-mode client \
--archives enviroment/dependencies.tar.gz#enviroment \
cluster-import-check.py

And my code is

# coding=utf8
from pyspark import SparkConf
from pyspark import SparkContext

import sys
import numpy

def check_import(version, info=None):
    print("=====VERSION : ", version)
    if info and type(info) == list and len(info) != 0:
        for i in info:
            print("=====INFO EXTRA : ", i)

def python_import(x):
    import sys
    print("\n===PYTHON")
    check_import(sys.version, [sys.executable])

def numpy_import(x):
    import numpy
    print("\n===NUMPY")
    check_import(numpy.__version__, [numpy.__file__])

def printInfo(object):
    print("=====NAME : ", object.__name__)
    if object.__name__ == 'sys':
        print("=====VERSION", object.version)
        print("=====LOCATED IN", object.executable)
    else:
        print("=====VERSION : ", object.__version__)
        print("=====LOCATED IN : ", object.__file__)

    if object.__name__ == 'elasticsearch':
        es = elasticsearch.Elasticsearch(['172.22.248.206:9201'])
        print("=====MORE INFO : ", es.info())

def init_spark():
    conf = SparkConf()
    conf.setAppName("imports-checking")

    sc = SparkContext(conf=conf).getOrCreate()

    return conf, sc


def main():
    conf, sc = init_spark()
    print(sc.getConf().getAll())

    print(sc.parallelize([0]).map(lambda x: python_import(x)).collect())

    sc.stop()

if __name__ == '__main__':
    printInfo(sys)
    printInfo(numpy)
    main()

And one error is no module named numpy or the located python is other, because in the cluster there are another version of python, but I want to use the whole enviroment shipped by yarn on the cluster.

Solution

I recently understand the workflow of PysPark with Yarn, the answer is that if you want to run in client mode you need installed (on the host that you execute the spark-submit) all the libraries imported outside the function maps. Otherhand, if you want to run in cluster mode, you only need to ship the libraries with the option --archives in the spark-submit command.

local

When it is executed it is done locally having to configure the PYSPARK_DRIVER_PYTHON in the execution of spark-submit

PYSPARK_DRIVER_PYTHON=./dependencies/bin/python spark-submit --master local --py-files cognition.zip MainScript.py

yarn-client

The execution is carried out on the host that executes the spark-submit command. The environment variable must be added

PYSPARK_DRIVER_PYTHON=./dependencies/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./dependencies/bin/python --master yarn --deploy-mode client --archives dependencies.tar.gz#dependencies MainScript.py

yarn-cluster

The execution is carried out inside the yarn container that it creates, and is executed in any node of the cluster

spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./dependencies/bin/python --master yarn --deploy-mode cluster --archives dependencies.tar.gz#dependencies MainScript.py