Search code examples
pythonramazon-web-servicesaws-glue

Running R in an AWS Glue job


Imagine you had a set of R scripts that form an ETL pipeline that you wanted to run as an AWS Glue job. AWS Glue supports Python and Scala.

Is it possible to call an R as a Python subprocess (or a bash script that wraps a set of R scripts) within an AWS Glue job running in a container with Python and R dependencies?

If so, please outline the steps required and key considerations.


Solution

  • As Glue doesn't natively support running R scripts, you can consider the following as an alternative:

    1. Customise your own Docker image
    2. Push the image to ECR
    3. Configure the compute resources and schedule using AWS Batch

    Example folder structure

    .
    ├── Dockerfile
    └── scripts
        └── rtest.R
    

    Example Dockerfile based on https://hub.docker.com/r/rocker/tidyverse

    FROM rocker/tidyverse:4.2.2
    WORKDIR /scripts
    COPY scripts/* /scripts
    RUN chmod 755 ./*
    # Install additional R libraries
    

    Example commands to push the image to ECR

    aws ecr get-login-password --region region | docker login --username AWS --password-stdin aws_account_id.dkr.ecr.region.amazonaws.com
    
    docker build -t rdev .
    
    docker tag rdev:latest aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest
    
    docker push aws_account_id.dkr.ecr.region.amazonaws.com/dev:latest
    

    Ref: https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html

    Then follow this guide to configure an ECS cluster on Fargate, create and execute a job: https://docs.aws.amazon.com/batch/latest/userguide/getting-started-fargate.html