Search code examples
dslnextflow

nextflow: error while running gatk - command runs fine on command line without xmx options


I do not provide any XmXX, etc. settings whilst running gatk. I run my main.nf on login node of a HPC server

I have main.nf params.outdir_fastp="/sc/arion/projects/name/user/output_pipeline_nextflow/test_tiny_datasets/trimmed"

params.outdir_index="/sc/arion/projects/path/name/output_pipeline_nextflow/test_tiny_datasets/bwa_index"

params.rawFiles = "/sc/arion/projects/user/name/tiny/tumor/*_R{1,2}_xxx.fastq.gz"
params.outdir_bwa_mem="/sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem"

params.hg38genome ="/sc/arion/projects/user/name/reference_genome/neisseria_meningitidis/NM.fasta"
params.gatk_mark_duplicates="/sc/arion/projects/username/path/output_pipeline_nextflow/test_tiny_datasets/gatk_mark_duplicates"

include { FASTP} from './fastp_process.nf'
include {bwa_index} from './index_process.nf'
include { align_bwa_mem} from './bwamem_process.nf'
include { gatk_markduplicates} from './gatk_markduplicates_process.nf'

workflow {

        read_pairs_ch = Channel.fromFilePairs( params.rawFiles )
        FASTP(read_pairs_ch) 
        bwa_index(params.hg38genome)
        align_bwa_mem(FASTP.out.reads,bwa_index.out) 
        gatk_markduplicates(align_bwa_mem.out.sorted_bams)
}

I have gatk_markduplicates_process.nf

process gatk_markduplicates {

debug true

    publishDir params.gatk_mark_duplicates , mode:"copy"

    input:
        tuple val(sample_id), path(sorted_bam) 
        
    output:
        tuple val(sample_id),path("${sample_id}.dedup.sorted.bam")
        tuple val(sample_id),path("${sample_id}.markdup.metrics.txt")

script:

"""
        echo "$sample_id ${params.outdir_bwa_mem}/${sorted_bam}\n"      
        ml gatk/4.1.3.0

        gatk MarkDuplicates -I ${params.outdir_bwa_mem}/${sorted_bam} \\
        -O ${sample_id}.dedup.sorted.bam \\
        -M ${sample_id}.markdup.metrics.txt

"""
}

I get error as:

Error executing process > 'gatk_markduplicates (2)'

Caused by:
  Process `gatk_markduplicates (2)` terminated with an error exit status (1)

Command executed:

  echo "tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
  " An e
  # hs_eml gatk/4.1.3.0
  
          gatk MarkDuplicates -I /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam \
          -O tiny_t_L007.dedup.sorted.bam \
          -M tiny_t_L007.markdup.metrics.txt

Command exit status:
  1

Command output:
  tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
  
  #
  # There is insufficient memory for the Java Runtime Environment to continue.
  # Cannot create GC thread. Out of system resources.
  # An error report file with more information is saved as:
  # hs_err_pid148553.log

Command error:
  tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
  
  
  The following have been reloaded with a version change:
    1) java/11.0.2 => java/1.8.0_211
  
  Using GATK jar /hpc/packages/minerva-centos7/gatk/4.1.3.0/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
  Running:
      java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /hpc/packages/minerva-centos7/gatk/4.1.3.0/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar MarkDuplicates -I /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam -O tiny_t_L007.dedup.sorted.bam -M tiny_t_L007.markdup.metrics.txt
  #
  # There is insufficient memory for the Java Runtime Environment to continue.
  # Cannot create GC thread. Out of system resources.
  # An error report file with more information is saved as:
  # hs_err_pid148553.log

Work dir:
  /sc/arion/projects/user/name/nextflow_pipeline/scripts_pipeline/work/e1/0b2f8095e9b0caf074691786ed4c1e

Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line

The following runs fine:

gatk MarkDuplicates -I /sc/arion/projects/name/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L003.sorted.bam         -O tiny_t_L003.dedup.sorted.bam         -M tiny_t_L003.markdup.metrics.txt

Or following also works fine:

java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /hpc/packages/minerva-centos7/gatk/4.1.3.0/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar MarkDuplicates -I /sc/arion/projects/username/path/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L001.sorted.bam -O tiny_t_L001.dedup.sorted.bam -M tiny_t_L001.markdup.metrics.txt

Solution

  • I think you're just out of memory here. If you haven't specified how much memory the job is allowed to use using the memory directive, it's likely your job is being submitted with whatever the default value is, as determined by your job scheduler. This is often a low number (e.g. 1 GB) and is less than what the JRE requires to run GATK. For a coordinate sorted BAM, using the --ASSUME_SORT_ORDER option might also be helpful here, for example:

    process gatk_markduplicates {
    
        tag { sample_id }
    
        publishDir params.gatk_mark_duplicates, mode: "copy"
    
        module 'gatk/4.1.3.0'
        memory 8.GB    
    
        input:
        tuple val(sample_id), path(sorted_bam) 
            
        output:
        tuple val(sample_id), path("${sample_id}.dedup.sorted.bam"), emit: bam
        tuple val(sample_id), path("${sample_id}.markdup.metrics.txt"), emit: metrics
    
        script:
        def avail_mem = task.memory ? task.memory.toGiga() : 0
        def java_options = [
            avail_mem ? "-Xmx${avail_mem}G" : "",
            "-Djava.io.tmpdir='\${PWD}/tmp'",
            "-XX:+UseSerialGC",
        ]
    
        """
        gatk \\
            --java-options "${java_options.join(' ')}" \\
        MarkDuplicates \\
            --INPUT "${sorted_bam}" \\
            --METRICS_FILE "${sample_id}.markdup.metrics.txt" \\
            --OUTPUT "${sample_id}.dedup.sorted.bam" \\
            --ASSUME_SORT_ORDER coordinate
        """
    }
    

    Note that your input block will ensure that the sorted BAM is staged into the process working directory. Using an absolute path here (i.e. params.outdir_bwa_mem) to access a different file outside of this directory with the same file name is not what you want.