Search code examples
pythondockerapache-sparkpysparkamazon-emr

is it possible to run spark udf functions (mainly python) under docker?


I'm using pyspark on emr. To simplify the setup of python libraries and dependencies, we're using docker images.

This works fine for general python applications (non spark), and for the spark driver (calling spark submit from within a docker image)

However, I couldn't find a method to make the workers run within a docker image (either the "full" worker, or just the UDF functions)

EDIT Found a solution with beta EMR version, if there's some alternative with current (5.*) EMR versions it's still relevant


Solution

  • Apparently yarn 3.2 supports this feature: https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html

    and it expected to be available with EMR 6 (now in beta) https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/