amazon-web-services docker amazon-s3 containers aws-batch

How to specify input and output volume S3 paths in AWS Batch job definition json?

I'm trying to adapt run an AWS Batch job using either an EC2 or a Fargate backend. The below example is for an EC2 instance AFAIK. I've adapted my docker containers to be able to mount input and output directories separately. When I run my docker container local, I run the following commands:

# Directories
LOCAL_WORKING_DIRECTORY=$(pwd)
LOCAL_OUTPUT_PARENT_DIRECTORY=../
LOCAL_OUTPUT_PARENT_DIRECTORY=$(realpath -m ${LOCAL_OUTPUT_PARENT_DIRECTORY})

CONTAINER_INPUT_DIRECTORY=/volumes/input/
CONTAINER_OUTPUT_DIRECTORY=/volumes/output/

# Parameters
ID=S1
R1=Fastq/${ID}_1.fastq.gz
R2=Fastq/${ID}_2.fastq.gz
NAME=VEBA-preprocess__${ID}
RELATIVE_OUTPUT_DIRECTORY=veba_output/preprocess/

# Command
CMD="preprocess.py -1 ${CONTAINER_INPUT_DIRECTORY}/${R1} -2 ${CONTAINER_INPUT_DIRECTORY}/${R2} -n ${ID} -o ${CONTAINER_OUTPUT_DIRECTORY}/${RELATIVE_OUTPUT_DIRECTORY}"

# Docker
DOCKER_IMAGE="jolespin/veba_preprocess:1.1.2"
docker run \
    --name ${NAME} \
    --rm \
    --volume ${LOCAL_WORKING_DIRECTORY}:${CONTAINER_INPUT_DIRECTORY} \
    --volume ${LOCAL_OUTPUT_PARENT_DIRECTORY}:${CONTAINER_OUTPUT_DIRECTORY} \
    ${DOCKER_IMAGE} \
    -c "${CMD}"

I'm having trouble figuring out how to use create an AWS Batch definition json.

My input filepath directory S3 URI is s3://path/to/input/ which will have 35_R1.fq.gz and 35_R2.fq.gz.

I'd like my output to be in the following S3 directory: s3://path/to/output/

How to specify input and output volume S3 paths in AWS Batch job definition json?

{
  "jobDefinitionName": "preprocess__35",
  "type": "container",
  "containerProperties": {
    "image": "jolespin/veba_preprocess:1.1.2",
    "vcpus": 4,
    "memory": 16000,
    "command": [
      "preprocess.py",
      "-1",
      "/volumes/input/35_R1.fq.gz",
      "-2",
      "/volumes/input/35_R2.fq.gz",
      "-n",
      "35",
      "-o",
      "/volumes/output/veba_output/preprocess",
      "-p",
      "4"
    ],
    "mountPoints": [
      {

      }
    ],
    "volumes": [
    
      
    ]
  }
}

Solution

I made a walkthrough here: https://github.com/jolespin/veba/blob/main/walkthroughs/adapting_commands_for_aws.md

Steps:

Set up AWS infrastructure
Create and register a job definition
Submit job definition

1. Set up AWS infrastructure

Out of scope for this tutorial but essentially you need to do the following:

Set up AWS EFS (Elastic File System) via Terraform to read/write/mount data
Compile database in EFS
Create compute environment
Create job queue linked to compute environment

2. Create and register a job definition

Once the job queue is properly set up, next is to create a job definition and then submit the job definition to the queue.

The preferred way to submit jobs with AWS Batch is using json files for the job definition through Fargate.

Here is a template you can use for a job definition.

This job definition pulls the jolespin/veba_preprocess Docker image and mounts EFS directories to volumes within the Docker container. The actual job runs the preprocess.py module of VEBA for a sample called S1.

{
  "jobDefinitionName": "preprocess__S1",
  "type": "container",
  "containerProperties": {
    "image": "jolespin/veba_preprocess:1.4.1",
    "command": [
      "preprocess.py",
      "-1",
      "/volumes/input/Fastq/S1_1.fastq.gz",
      "-2",
      "/volumes/input/Fastq/S1_2.fastq.gz",
      "-n",
      "1",
      "-o",
      "/volumes/output/veba_output/preprocess",
      "-p",
      "16"
      "-x",
      "/volumes/database/Contamination/chm13v2.0/chm13v2.0"
    ],
    "jobRoleArn": "arn:aws:iam::xxx:role/ecsTaskExecutionRole",
    "executionRoleArn": "arn:aws:iam::xxx:role/ecsTaskExecutionRole",
    "volumes": [
      {
        "name": "efs-volume-database",
        "efsVolumeConfiguration": {
          "fileSystemId": "fs-xxx",
          "transitEncryption": "ENABLED",
          "rootDirectory": "databases/veba/VDB_v5.1/"
        }
      },
      {
        "name": "efs-volume-input",
        "efsVolumeConfiguration": {
          "fileSystemId": "fs-xxx",
          "transitEncryption": "ENABLED",
          "rootDirectory": "path/to/efs/input/"
        }
      },
      {
        "name": "efs-volume-output",
        "efsVolumeConfiguration": {
          "fileSystemId": "fs-xxx",
          "transitEncryption": "ENABLED",
          "rootDirectory": "path/to/efs/output/"
        }
      }
    ],
    "mountPoints": [
    {
        "sourceVolume": "efs-volume-database",
        "containerPath": "/volumes/database",
        "readOnly": true
      },
      {
        "sourceVolume": "efs-volume-input",
        "containerPath": "/volumes/input",
        "readOnly": true
      },
      {
        "sourceVolume": "efs-volume-output",
        "containerPath": "/volumes/output",
        "readOnly": false
      }
    ],
    "environment": [],
    "ulimits": [],
    "resourceRequirements": [
      {
        "value": "16.0",
        "type": "VCPU"
      },
      {
        "value": "8000",
        "type": "MEMORY"
      }
    ],
    "networkConfiguration": {
      "assignPublicIp": "ENABLED"
    },
    "fargatePlatformConfiguration": {
      "platformVersion": "LATEST"
    },
    "ephemeralStorage": {
      "sizeInGiB": 40
    }
  },
  "tags": {
    "Name": "preprocess__S1"
  },
  "platformCapabilities": [
    "FARGATE"
  ]
}

Now register the job definition:

FILE=/path/to/preprocess/S1.json
aws batch register-job-definition --cli-input-json file://${FILE}

3. Run Docker container

Next step is to submit the job to the queue.

aws batch submit-job --job-definition ${JOB} --job-name ${JOB} --job-queue ${QUEUE}