python docker apache-spark pyspark amazon-emr

is it possible to run spark udf functions (mainly python) under docker?

I'm using pyspark on emr. To simplify the setup of python libraries and dependencies, we're using docker images.

This works fine for general python applications (non spark), and for the spark driver (calling spark submit from within a docker image)

However, I couldn't find a method to make the workers run within a docker image (either the "full" worker, or just the UDF functions)

EDIT Found a solution with beta EMR version, if there's some alternative with current (5.*) EMR versions it's still relevant

Solution

Apparently yarn 3.2 supports this feature: https://hadoop.apache.org/docs/r3.2.0/hadoop-yarn/hadoop-yarn-site/DockerContainers.html

and it expected to be available with EMR 6 (now in beta) https://aws.amazon.com/blogs/big-data/run-spark-applications-with-docker-using-amazon-emr-6-0-0-beta/

Issues when running http server with deferred threads under high load in twisted
str() of a dict subclass does not return "{}" per the MRO
Scale deployment with python client in Kubernetes
How to bypass expectation of S3 server 100-contiune response in Boto3 put_object method
Python Polars unable to convert f64 column to str and aggregate to list
problem with insert GIF in python(TKinter)?
How to set `lock_timeout` on a PostgreSQL connection with SQLAlchemy and psycopg2?
Breaking out of nested loops
Rolling aggregation in polars and also get the original column back without join or using .over
To/From Paging in Python
Using a function to set log file id with snakemake
Streaming server issue with gunicorn and flask and Nginx
ParserError when reading csv file from github
Mypy issue 'name <func> already defined' when using hybrid property and expression of sqlalchemy
Import from file without executing imports at the top of that file
creating a category system with MVT in django
pandas.series.str.split() not accepting 3 keyword arguments
Difficulty understanding nested loops with addition, range and print function
Ubuntu, how do you remove all Python 3 but not 2
How to do conditional scaling in polars?
How to insert a character after every 2 characters in a string
Improving polars statement that adds a column applying a lambda function on each row
Pythonic way to hex dump files
Gevent cant be installed on M1 mac using poetry
Count of specific value of column in group
Error <Figure size 1000x600 with 1 Axes> even with plt.figure () before plt.plot
Possible to calculate counts and percentage in one chain using polars?
Python PyHANDLE object in win32gui
Type hinting / annotation (PEP 484) for numpy.ndarray
Flask-SQLAlchemy tables are defined in the schema but not getting created