I am deploying a spark job in AWS EMR and packaging all my dependencies using docker. My pythonized spark submit command looks like this
...
cmd = (
f"spark-submit --deploy-mode cluster "
f"spark-submit --deploy-mode {deploy_mode} "
f"--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker "
f"--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={docker_image} "
f"--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG={config} "
f"--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro "
f"--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker "
f"--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE={docker_image} "
f"--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG={config} "
f"--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro "
f"{path}"
)
...
It worked as expected when my deploy_mode
is cluster but I don't see any of my docker dependency when deploy_mode
is client. Can anyone help why this is happening and is it normal?
The docker containers are managed by Yarn on EMR: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-docker.html
In client mode, your Spark driver doesn't run in a docker container because that process is not managed by Yarn, it is directly executed on the node that runs the spark-submit
command.
In cluster mode your driver is managed by Yarn and as so executed inside a docker container.