Imagine you had a set of R scripts that form an ETL pipeline that you wanted to run as an AWS Glue job. AWS Glue supports Python and Scala.
Is it possible to call an R as a Python subprocess (or a bash script that wraps a set of R scripts) within an AWS Glue job running in a container with Python and R dependencies?
If so, please outline the steps required and key considerations.
As Glue doesn't natively support running R scripts, you can consider the following as an alternative:
Example folder structure
.
├── Dockerfile
└── scripts
└── rtest.R
Example Dockerfile based on https://hub.docker.com/r/rocker/tidyverse
FROM rocker/tidyverse:4.2.2
WORKDIR /scripts
COPY scripts/* /scripts
RUN chmod 755 ./*
# Install additional R libraries
Example commands to push the image to ECR
aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com
docker build -t rdev .
docker tag rdev:latest aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest
docker push aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest
Ref: https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html
Then follow this guide to configure an ECS cluster on Fargate, create and execute a job: https://docs.aws.amazon.com/batch/latest/userguide/getting-started-fargate.html