Search code examples
aws-batchnextflow

Nextflow: How to deal with out of memory error?


I wanted to test Nextflow error handling with aws batch executor. I used stress to fill 20GB of memory, while initially allocating only 12GB and applied standard error strategy (as in manual).

#!/usr/bin/env nextflow

nextflow.enable.dsl=2

process test {

cpus 2
memory { '12.GB' * task.attempt }
errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries 3

"""
stress -c 2 -t 60 --vm 20 --vm-bytes 1024M 
"""
}

workflow {
  test()
}

Although the error message is:

Caused by:
  Essential container in task exited - OutOfMemoryError: Container killed due to memory usage

..the exit status is 8 (rather than 137..140, so resources are not adjusted):

Command exit status:
  8

What might be the problem here? Thanks!


Solution

  • The problem might be that you're expecting a certain exit status (128+9 = 137) but there are really no guarantees in life. The reason you get an exit status of 8 here (or any int really) has to do with how stress works:

    It is a single file called stress.c whose internal organization is in essence a loop that forks worker processes and then waits for them to either complete normally or exit with an error.

    So while waiting for the workers to exit, a return value (which is initialized using retval = 0) is incremented each time a worker returns an error. The program then exits with the return value, which provides the exit status. This ensures that we get a non-zero exit status when a single worker returns an error.

    Ultimately, a decision on the most appropriate errorStrategy needs to be made for each process. For the command above (i.e. stress), you may simply want errorStrategy { 'retry' } for testing. In production, however, I find using a dynamic retry with exponential backoff works quite well since we would want to 'retry' usually anyway. To make this the 'default' errorStrategy, just add this to your nextflow.config:

    process {
    
      errorStrategy = {
        sleep( Math.pow( 2, task.attempt ) * 150 as long )
        return 'retry'
      }
      maxRetries = 3
    
      ...