I'm using pyspark
on emr
. To simplify the setup of python
libraries and dependencies, we're using docker
images.
This works fine for general python applications (non spark), and for the spark driver (calling spark submit from within a docker image)
However, I couldn't find a method to make the workers run within a docker image (either the "full" worker, or just the UDF
functions)
EDIT Found a solution with beta EMR version, if there's some alternative with current (5.*) EMR versions it's still relevant
Apparently yarn 3.2 supports this feature: https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html
and it expected to be available with EMR 6 (now in beta) https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/