I wanted to test Nextflow error handling with aws batch executor. I used stress
to fill 20GB of memory, while initially allocating only 12GB and applied standard error strategy (as in manual).
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process test {
cpus 2
memory { '12.GB' * task.attempt }
errorStrategy { task.exitStatus in 137..140 ? 'retry' : 'terminate' }
maxRetries 3
"""
stress -c 2 -t 60 --vm 20 --vm-bytes 1024M
"""
}
workflow {
test()
}
Although the error message is:
Caused by:
Essential container in task exited - OutOfMemoryError: Container killed due to memory usage
..the exit status is 8
(rather than 137..140, so resources are not adjusted):
Command exit status:
8
What might be the problem here? Thanks!
The problem might be that you're expecting a certain exit status (128+9 = 137) but there are really no guarantees in life. The reason you get an exit status of 8 here (or any int really) has to do with how stress works:
It is a single file called
stress.c
whose internal organization is in essence a loop that forks worker processes and then waits for them to either complete normally or exit with an error.
So while waiting for the workers to exit, a return value (which is initialized using retval = 0
) is incremented each time a worker returns an error. The program then exits with the return value, which provides the exit status. This ensures that we get a non-zero exit status when a single worker returns an error.
Ultimately, a decision on the most appropriate errorStrategy needs to be made for each process. For the command above (i.e. stress), you may simply want errorStrategy { 'retry' }
for testing. In production, however, I find using a dynamic retry with exponential backoff works quite well since we would want to 'retry' usually anyway. To make this the 'default' errorStrategy, just add this to your nextflow.config:
process {
errorStrategy = {
sleep( Math.pow( 2, task.attempt ) * 150 as long )
return 'retry'
}
maxRetries = 3
...