Nextflow on GCP - waiting on container error

I'm running a pipeline on using nextflow on google batch. However, I'm getting the following error:

ERROR ~ Error executing process > 'PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)'

Caused by:
  Process `PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)` terminated with an error exit status (null)

Command executed:

  mkdir output
  nlrexpress.py \
        --input All_Candidate_Soybean_Prots_Simplified_Sorted.fasta \
        --outdir ./output \
        --module all

  mv output/*.short.output.txt ./

Command exit status:
  null

Command output:
  15/06/2023 15:36:31:  ############ NLRexpress started ############
  15/06/2023 15:36:31:  Input FASTA: All_Candidate_Soybean_Prots_Simplified_Sorted.fasta
  15/06/2023 15:36:31:  Checking FASTA file - started
  15/06/2023 15:36:31:  Checking FASTA file - done
  15/06/2023 15:36:31:  Running JackHMMER - started

Command error:
  time="2023-06-15T15:39:22Z" level=error msg="error waiting for container: "

Work dir:
  gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1

Tip: view the complete command output by changing to the process work dir and entering the command `cat .command.out`

 -- Check '.nextflow.log' file for details

The module nf file is here:

process NLREXPRESS {
  tag "$sample_id"
  maxForks 1
  container = 'dthorbur1990/nlrexpress:latest'

  cpus { 4 * task.attempt }
  memory { 12.GB * task.attempt }
  disk "15.GB"

  publishDir(
    path: "${params.PlantDir}",
    mode: 'copy',
  )
  
  input:
      tuple val(sample_id), path(peptides)

  output:
      path "*.short.output.txt", emit: nlre_out

  script:
  """
  mkdir output
  nlrexpress.py \\
        --input ${peptides} \\
        --outdir ./output \\
        --module ${params.NE_Modules}

  mv output/*.short.output.txt ./
  """
}

The process was running without error when I ran it locally, and I have rebuilt the container and it works as intended.

What confuses me is that the workDir doesn't contain either .command.{out,err} files suggesting (to me at least) that it's not running. But the Command output section of the error message is the correct first few lines of the tool.

Here is the workDir:

gsutil ls gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.begin
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.run
gs://rb-rnaseq/workDir/6e/090e663de08b69ce6c9506dc4975c1/.command.sh

And here is the end of the log file regarding the NLREXPRESS module:

All_Candidate_Soybean_Prots_Simplified_Sorted)","q3Label":"PLANT:NLREXPRESS (All_Candidate_Soybean_Prots_Simplified_Sorted)"},"writes":null},{"cpuUsage":null,"process":"ORIENTATION","mem":null,"memUsage":null,"timeUsage":null,"vmem":null,"reads":null,"cpu":null,"time":null,"writes":null}]

I'm at a loss. I've tried increasing memory but that hasn't seemed to have worked. Any ideas? Happy to add the nextflow.log file if that would be helpful.

Solution

I'm not sure if I have an answer for you, but I think this behavior might have something to do with how Nextflow runs the job. If you look at the end of the nxf_main function in the .command.run script, you'll see something like:

nxf_main() {

    ...

    set +e
    ctmp=$(set +u; nxf_mktemp /dev/shm 2>/dev/null || nxf_mktemp $TMPDIR)
    local cout=$ctmp/.command.out; mkfifo $cout
    local cerr=$ctmp/.command.err; mkfifo $cerr
    tee .command.out < $cout &
    tee1=$!
    tee .command.err < $cerr >&2 &
    tee2=$!
    ( nxf_launch ) >$cout 2>$cerr &
    pid=$!
    wait $pid || nxf_main_ret=$?
    wait $tee1 $tee2
    nxf_unstage
}

When errexit is enabled (set -e), any command that returns a non-zero exit status immediately terminates the script. So by using set +e, we are explicitly disabling this behavior. This means that .command.out and .command.err may not necessarily be created despite the Docker container being run (via nxf_launch).

So I wonder if there is a problem with the size of your /dev/shm? You could try using the docker.runOptions configuration scope to bump the shm-size¹. For example, with the following to your nextflow.config:

docker {

    enabled = true
    runOptions = '--shm-size 2g'
}