nextflow: error while running gatk - command runs fine on command line without xmx options

I do not provide any XmXX, etc. settings whilst running gatk. I run my on login node of a HPC server

I have params.outdir_fastp="/sc/arion/projects/name/user/output_pipeline_nextflow/test_tiny_datasets/trimmed"


params.rawFiles = "/sc/arion/projects/user/name/tiny/tumor/*_R{1,2}_xxx.fastq.gz"

params.hg38genome ="/sc/arion/projects/user/name/reference_genome/neisseria_meningitidis/NM.fasta"

include { FASTP} from './'
include {bwa_index} from './'
include { align_bwa_mem} from './'
include { gatk_markduplicates} from './'

workflow {

        read_pairs_ch = Channel.fromFilePairs( params.rawFiles )

I have

process gatk_markduplicates {

debug true

    publishDir params.gatk_mark_duplicates , mode:"copy"

        tuple val(sample_id), path(sorted_bam) 
        tuple val(sample_id),path("${sample_id}.dedup.sorted.bam")
        tuple val(sample_id),path("${sample_id}.markdup.metrics.txt")


        echo "$sample_id ${params.outdir_bwa_mem}/${sorted_bam}\n"      
        ml gatk/

        gatk MarkDuplicates -I ${params.outdir_bwa_mem}/${sorted_bam} \\
        -O ${sample_id}.dedup.sorted.bam \\
        -M ${sample_id}.markdup.metrics.txt


I get error as:

Error executing process > 'gatk_markduplicates (2)'

Caused by:
  Process `gatk_markduplicates (2)` terminated with an error exit status (1)

Command executed:

  echo "tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
  " An e
  # hs_eml gatk/
          gatk MarkDuplicates -I /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam \
          -O tiny_t_L007.dedup.sorted.bam \
          -M tiny_t_L007.markdup.metrics.txt

Command exit status:

Command output:
  tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
  # There is insufficient memory for the Java Runtime Environment to continue.
  # Cannot create GC thread. Out of system resources.
  # An error report file with more information is saved as:
  # hs_err_pid148553.log

Command error:
  tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
  The following have been reloaded with a version change:
    1) java/11.0.2 => java/1.8.0_211
  Using GATK jar /hpc/packages/minerva-centos7/gatk/
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /hpc/packages/minerva-centos7/gatk/ MarkDuplicates -I /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam -O tiny_t_L007.dedup.sorted.bam -M tiny_t_L007.markdup.metrics.txt
  # There is insufficient memory for the Java Runtime Environment to continue.
  # Cannot create GC thread. Out of system resources.
  # An error report file with more information is saved as:
  # hs_err_pid148553.log

Work dir:

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

The following runs fine:

gatk MarkDuplicates -I /sc/arion/projects/name/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L003.sorted.bam         -O tiny_t_L003.dedup.sorted.bam         -M tiny_t_L003.markdup.metrics.txt

Or following also works fine:

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /hpc/packages/minerva-centos7/gatk/ MarkDuplicates -I /sc/arion/projects/username/path/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L001.sorted.bam -O tiny_t_L001.dedup.sorted.bam -M tiny_t_L001.markdup.metrics.txt


  • I think you're just out of memory here. If you haven't specified how much memory the job is allowed to use using the memory directive, it's likely your job is being submitted with whatever the default value is, as determined by your job scheduler. This is often a low number (e.g. 1 GB) and is less than what the JRE requires to run GATK. For a coordinate sorted BAM, using the --ASSUME_SORT_ORDER option might also be helpful here, for example:

    process gatk_markduplicates {
        tag { sample_id }
        publishDir params.gatk_mark_duplicates, mode: "copy"
        module 'gatk/'
        memory 8.GB    
        tuple val(sample_id), path(sorted_bam) 
        tuple val(sample_id), path("${sample_id}.dedup.sorted.bam"), emit: bam
        tuple val(sample_id), path("${sample_id}.markdup.metrics.txt"), emit: metrics
        def avail_mem = task.memory ? task.memory.toGiga() : 0
        def java_options = [
            avail_mem ? "-Xmx${avail_mem}G" : "",
        gatk \\
            --java-options "${java_options.join(' ')}" \\
        MarkDuplicates \\
            --INPUT "${sorted_bam}" \\
            --METRICS_FILE "${sample_id}.markdup.metrics.txt" \\
            --OUTPUT "${sample_id}.dedup.sorted.bam" \\
            --ASSUME_SORT_ORDER coordinate

    Note that your input block will ensure that the sorted BAM is staged into the process working directory. Using an absolute path here (i.e. params.outdir_bwa_mem) to access a different file outside of this directory with the same file name is not what you want.