apache-spark google-cloud-platform apache-beam google-cloud-dataproc apache-beam-internals

In Apache Beam's SparkRunner, how does the DOCKER environment_type affect an existing Spark cluster?

In Apache Beam's Spark documentation, it says that you can specify --environment_type="DOCKER" to customize the runtime environment:

The Beam SDK runtime environment can be containerized with Docker to isolate it from other runtime systems. To learn more about the container environment, read the Beam SDK Harness container contract.

...

You may want to customize container images for many reasons, including:

Pre-installing additional dependencies

Launching third-party software in the worker environment

Further customizing the execution environment

...

python -m apache_beam.examples.wordcount \
--input=/path/to/inputfile \
--output=path/to/write/counts \
--runner=SparkRunner \
# When running batch jobs locally, we need to reuse the container.
--environment_cache_millis=10000 \
--environment_type="DOCKER" \
--environment_config="${IMAGE_URL}"

If you submit this job to an existing Spark cluster, what does the docker image do to the Spark cluster? Does it run all the Spark executors with that Docker image? What happens to the existing Spark executors if there are any? What about the Spark driver? What is the mechanism used (Spark Driver API?) to distribute the Docker image to the machines?

Solution

TL;DR The answer for this question is on this picture (it's taken form my talk for Beam Summit 2020 about running cross-language pipelines on Beam).

For example, if you run your Beam pipeline with Beam Portable Runner on Spark cluster, then Beam Portable Spark Runner will translate your job into a normal Spark job and then submit/run it on ordinary Spark cluster. So, it will use driver/executors of your Spark cluster (as usually).

As you can see from this picture, the Docker container is using just as part of SDK Harness to execute DoFn code independently from the "main" language of your pipeline (for example, run some Python code as a part of Java pipeline).

The only requirement, iirc, is that your Spark executors should be have installed Docker to run Docker container(s). Also, you can pre-fetch Beam SDK Docker images on Spark executors nodes to avoid it while running your job for the first time.

Alternative solution, that Beam Portability provides for portable pipelines, could be to execute SDK Harness as just a normal system process. In this case, you need to specify environment_type="PROCESS" and provide a path to executable file (that obviously has to be installed on all executor nodes).