Search code examples
amazon-s3aws-fargatesnakemakeaws-batch

Snakemake running as an AWS Batch or an AWS Fargate task raises MissingInputException on the inputs stored on a S3 bucket


We have a Dockerized Snakemake pipeline with the input data stored on a S3 bucket snakemake-bucket:

Snakefile:

rule bwa_map:
    input:
        "data/genome.fa"
    output:
        "results/mapped/A.bam"
    shell:
        "cat {input} > {output}"

Dockerfile

FROM snakemake/snakemake:v8.15.2
RUN mamba install -c conda-forge -c bioconda snakemake-storage-plugin-s3
WORKDIR /app
COPY ./workflow ./workflow
ENV PYTHONWARNINGS="ignore:Unverified HTTPS request"
CMD ["snakemake","--default-storage-provider","s3","--default-storage-prefix","s3://snakemake-bucket","results/mapped/A.bam","--cores","1","--verbose","--printshellcmds"]

When we run the container with the following command, it downloads the input file, runs the pipeline and stores the output on the bucket successfully:

docker run -it -e SNAKEMAKE_STORAGE_S3_ACCESS_KEY=**** -e SNAKEMAKE_STORAGE_S3_SECRET_KEY=****  our-snakemake:v0.0.10

However, when we deploy it as an AWS Batch Job or AWS Fargate Task, it gives the following error immediately:

Assuming unrestricted shared filesystem usage.
Building DAG of jobs...
Full Traceback (most recent call last):
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/cli.py", line 2103, in args_to_api
    dag_api.execute_workflow(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/api.py", line 594, in execute_workflow
    workflow.execute(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1081, in execute
    self._build_dag()
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/workflow.py", line 1037, in _build_dag
    async_run(self.dag.init())
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/common/__init__.py", line 94, in async_run
    return asyncio.run(coroutine)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 183, in init
    job = await self.update(
          ^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 1013, in update
    raise exceptions[0]
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 970, in update
    await self.update_(
  File "/opt/conda/envs/snakemake/lib/python3.12/site-packages/snakemake/dag.py", line 1137, in update_
    raise MissingInputException(job, missing_input)
snakemake.exceptions.MissingInputException: Missing input files for rule bwa_map:
    output: results/mapped/A.bam
    wildcards: sample=A
    affected files:
        s3://snakemake-bucket/data/genome.fa (storage)

MissingInputException in rule bwa_map in file /app/workflow/Snakefile, line 10:
Missing input files for rule bwa_map:
    output: results/mapped/A.bam
    wildcards: sample=A
    affected files:
        s3://snakemake-bucket/data/genome.fa (storage)

Any ideas would be appreciated

  • The image works fine on local and alos on an external VPS but it doesn't work on an AWS Fargate.
  • The file on the bucket is accessible and downloadable from inside the container on the AWS task, checked by: /opt/conda/envs/snakemake/bin/python -c "import os ;import boto3 ;s3 = boto3.resource('s3',aws_access_key_id=os.environ.get('SNAKEMAKE_STORAGE_S3_ACCESS_KEY'), aws_secret_access_key=os.environ.get('SNAKEMAKE_STORAGE_S3_SECRET_KEY')) ;my_bucket = s3.Bucket('snakemake-bucket') ; [ my_bucket.download_file(d.key,d.key) for d in my_bucket.objects.all()];print(os.listdir())"
  • We added --use-conda and --software-deployment-method conda, no change.
  • The environment variables passed to the containers are the same. Only some AWS_* and ECS_* related variables are added.
  • A volume mount to .snakemake hasn't changed the outcome.
  • Kernel Versions: AWS: 5.10.219-208.866.amzn2.x86_64 Local: 5.15.0-97-generic
  • Changing the Snakefile to use Storage Support Within Workflow has no effect
  • The Job/task runs the pipeline successfully with local input/output files.
  • Snakemake Docker tag: snakemake/snakemake:v8.15.2

Solution

  • It seems in AWS Fargate sets some environment variables including AWS_CONTAINER_CREDENTIALS_RELATIVE_URI on which boto3 decide that it needs AWS_SECRET_ACCESS_KEY and AWS_ACCESS_KEY_ID in addition to SNAKEMAKE_STORAGE_S3_SECRET_KEY and SNAKEMAKE_STORAGE_S3_ACCESS_KEY. if you want to run Snakemake in AWS Fargate you have to set all 4 variables, or you have to unset AWS_CONTAINER_CREDENTIALS_RELATIVE_URI in your docker entrypoint.sh.