Python : 3.7.3
OS: CentOS 7
Spark: 2.2.0
In Cloudera
YARN : 2.6.0-cdh5.10.2
Hi, I tried to executed Apache Spark with python scripts with pyspark, but I dont understand how it works the workflow. I try to send a whole conda enviroment with yarn in client mode with the argument --archives
when executed the spark-submit
. But the question is, where is the main python script running, because I need to specify the location of my shared conda enviroment to executed without errors, because in the host that I try to exectued the spark-submit
I havent the dependencies installed, and i dont want to install it.
I use this feature to pack the enviroment https://conda.github.io/conda-pack/spark.html, and I need to import the dependencies outside of a map (because inside a map, the yarn shipped the dependencies and the executors import well this dependencies).
There are a way to execute the main python script with the shipped enviroment without open and using on the host?
my envs are:
PYSPARK_DRIVER_PYTHON=./enviroment/bin/python
PYSPARK_PYTHON=./enviroment/bin/python
where enviroment is the symbolic link of dependencies shipped with yarn
--archives ~/dependencies.tar.gz#enviroment
And configure the executors with
--conf spark.yarn.appMasterEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python
So the final command is
PYSPARK_DRIVER_PYTHON=./enviroment/bin/python \
PYSPARK_PYTHON=./environment/bin/python \
spark-submit \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./environment/bin/python \
--master yarn --deploy-mode client \
--archives enviroment/dependencies.tar.gz#enviroment \
cluster-import-check.py
And my code is
# coding=utf8
from pyspark import SparkConf
from pyspark import SparkContext
import sys
import numpy
def check_import(version, info=None):
print("=====VERSION : ", version)
if info and type(info) == list and len(info) != 0:
for i in info:
print("=====INFO EXTRA : ", i)
def python_import(x):
import sys
print("\n===PYTHON")
check_import(sys.version, [sys.executable])
def numpy_import(x):
import numpy
print("\n===NUMPY")
check_import(numpy.__version__, [numpy.__file__])
def printInfo(object):
print("=====NAME : ", object.__name__)
if object.__name__ == 'sys':
print("=====VERSION", object.version)
print("=====LOCATED IN", object.executable)
else:
print("=====VERSION : ", object.__version__)
print("=====LOCATED IN : ", object.__file__)
if object.__name__ == 'elasticsearch':
es = elasticsearch.Elasticsearch(['172.22.248.206:9201'])
print("=====MORE INFO : ", es.info())
def init_spark():
conf = SparkConf()
conf.setAppName("imports-checking")
sc = SparkContext(conf=conf).getOrCreate()
return conf, sc
def main():
conf, sc = init_spark()
print(sc.getConf().getAll())
print(sc.parallelize([0]).map(lambda x: python_import(x)).collect())
sc.stop()
if __name__ == '__main__':
printInfo(sys)
printInfo(numpy)
main()
And one error is no module named numpy
or the located python is other, because in the cluster there are another version of python, but I want to use the whole enviroment shipped by yarn on the cluster.
I recently understand the workflow of PysPark with Yarn, the answer is that if you want to run in client mode you need installed (on the host that you execute the spark-submit) all the libraries imported outside the function maps. Otherhand, if you want to run in cluster mode, you only need to ship the libraries with the option --archives in the spark-submit command.
When it is executed it is done locally having to configure the PYSPARK_DRIVER_PYTHON in the execution of spark-submit
PYSPARK_DRIVER_PYTHON=./dependencies/bin/python spark-submit --master local --py-files cognition.zip MainScript.py
The execution is carried out on the host that executes the spark-submit command. The environment variable must be added
PYSPARK_DRIVER_PYTHON=./dependencies/bin/python spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./dependencies/bin/python --master yarn --deploy-mode client --archives dependencies.tar.gz#dependencies MainScript.py
The execution is carried out inside the yarn container that it creates, and is executed in any node of the cluster
spark-submit --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./dependencies/bin/python --master yarn --deploy-mode cluster --archives dependencies.tar.gz#dependencies MainScript.py