I do not provide any XmXX, etc. settings whilst running gatk. I run my main.nf on login node of a HPC server
I have main.nf params.outdir_fastp="/sc/arion/projects/name/user/output_pipeline_nextflow/test_tiny_datasets/trimmed"
params.outdir_index="/sc/arion/projects/path/name/output_pipeline_nextflow/test_tiny_datasets/bwa_index"
params.rawFiles = "/sc/arion/projects/user/name/tiny/tumor/*_R{1,2}_xxx.fastq.gz"
params.outdir_bwa_mem="/sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem"
params.hg38genome ="/sc/arion/projects/user/name/reference_genome/neisseria_meningitidis/NM.fasta"
params.gatk_mark_duplicates="/sc/arion/projects/username/path/output_pipeline_nextflow/test_tiny_datasets/gatk_mark_duplicates"
include { FASTP} from './fastp_process.nf'
include {bwa_index} from './index_process.nf'
include { align_bwa_mem} from './bwamem_process.nf'
include { gatk_markduplicates} from './gatk_markduplicates_process.nf'
workflow {
read_pairs_ch = Channel.fromFilePairs( params.rawFiles )
FASTP(read_pairs_ch)
bwa_index(params.hg38genome)
align_bwa_mem(FASTP.out.reads,bwa_index.out)
gatk_markduplicates(align_bwa_mem.out.sorted_bams)
}
I have gatk_markduplicates_process.nf
process gatk_markduplicates {
debug true
publishDir params.gatk_mark_duplicates , mode:"copy"
input:
tuple val(sample_id), path(sorted_bam)
output:
tuple val(sample_id),path("${sample_id}.dedup.sorted.bam")
tuple val(sample_id),path("${sample_id}.markdup.metrics.txt")
script:
"""
echo "$sample_id ${params.outdir_bwa_mem}/${sorted_bam}\n"
ml gatk/4.1.3.0
gatk MarkDuplicates -I ${params.outdir_bwa_mem}/${sorted_bam} \\
-O ${sample_id}.dedup.sorted.bam \\
-M ${sample_id}.markdup.metrics.txt
"""
}
I get error as:
Error executing process > 'gatk_markduplicates (2)'
Caused by:
Process `gatk_markduplicates (2)` terminated with an error exit status (1)
Command executed:
echo "tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
" An e
# hs_eml gatk/4.1.3.0
gatk MarkDuplicates -I /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam \
-O tiny_t_L007.dedup.sorted.bam \
-M tiny_t_L007.markdup.metrics.txt
Command exit status:
1
Command output:
tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Cannot create GC thread. Out of system resources.
# An error report file with more information is saved as:
# hs_err_pid148553.log
Command error:
tiny_t_L007 /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam
The following have been reloaded with a version change:
1) java/11.0.2 => java/1.8.0_211
Using GATK jar /hpc/packages/minerva-centos7/gatk/4.1.3.0/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /hpc/packages/minerva-centos7/gatk/4.1.3.0/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar MarkDuplicates -I /sc/arion/projects/user/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L007.sorted.bam -O tiny_t_L007.dedup.sorted.bam -M tiny_t_L007.markdup.metrics.txt
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Cannot create GC thread. Out of system resources.
# An error report file with more information is saved as:
# hs_err_pid148553.log
Work dir:
/sc/arion/projects/user/name/nextflow_pipeline/scripts_pipeline/work/e1/0b2f8095e9b0caf074691786ed4c1e
Tip: when you have fixed the problem you can continue the execution adding the option `-resume` to the run command line
The following runs fine:
gatk MarkDuplicates -I /sc/arion/projects/name/name/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L003.sorted.bam -O tiny_t_L003.dedup.sorted.bam -M tiny_t_L003.markdup.metrics.txt
Or following also works fine:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /hpc/packages/minerva-centos7/gatk/4.1.3.0/gatk-4.1.3.0/gatk-package-4.1.3.0-local.jar MarkDuplicates -I /sc/arion/projects/username/path/output_pipeline_nextflow/test_tiny_datasets/bwamem/tiny_t_L001.sorted.bam -O tiny_t_L001.dedup.sorted.bam -M tiny_t_L001.markdup.metrics.txt
I think you're just out of memory here. If you haven't specified how much memory the job is allowed to use using the memory
directive, it's likely your job is being submitted with whatever the default value is, as determined by your job scheduler. This is often a low number (e.g. 1 GB) and is less than what the JRE requires to run GATK. For a coordinate sorted BAM, using the --ASSUME_SORT_ORDER
option might also be helpful here, for example:
process gatk_markduplicates {
tag { sample_id }
publishDir params.gatk_mark_duplicates, mode: "copy"
module 'gatk/4.1.3.0'
memory 8.GB
input:
tuple val(sample_id), path(sorted_bam)
output:
tuple val(sample_id), path("${sample_id}.dedup.sorted.bam"), emit: bam
tuple val(sample_id), path("${sample_id}.markdup.metrics.txt"), emit: metrics
script:
def avail_mem = task.memory ? task.memory.toGiga() : 0
def java_options = [
avail_mem ? "-Xmx${avail_mem}G" : "",
"-Djava.io.tmpdir='\${PWD}/tmp'",
"-XX:+UseSerialGC",
]
"""
gatk \\
--java-options "${java_options.join(' ')}" \\
MarkDuplicates \\
--INPUT "${sorted_bam}" \\
--METRICS_FILE "${sample_id}.markdup.metrics.txt" \\
--OUTPUT "${sample_id}.dedup.sorted.bam" \\
--ASSUME_SORT_ORDER coordinate
"""
}
Note that your input
block will ensure that the sorted BAM is staged into the process working directory. Using an absolute path here (i.e. params.outdir_bwa_mem
) to access a different file outside of this directory with the same file name is not what you want.