Search code examples
dslnextflow

nextflow: WARN: Input tuple does not match input set cardinality declared by process - using tuple


I've multiple processes, I take output of one as an input for another. I'm unable to understand when tuple works and when it doesn't.

main.nf

params.outdir_fastp="./trimmed"
params.outdir_index="./bwa_index" 
params.rawFiles = "/Users/username/Downloads/tiny/tumor/*_R{1,2}_xxx.fastq.gz"
params.outdir_bwa_mem="./bwamem"
params.hg38genome ="/Users/username/Downloads/NM.fasta"
params.gatk_mark_duplicates="./gatk_mark_duplicates"


include { FASTP} from './fastp_process.nf'
include {bwa_index} from './index_process.nf'
include { align_bwa_mem} from './bwamem_process.nf'
include { gatk_markduplicates} from './gatk_markduplicates_process.nf'

workflow {
    read_pairs_ch = Channel.fromFilePairs( params.rawFiles )
    FASTP(read_pairs_ch) 
    bwa_index(params.hg38genome)
    align_bwa_mem(FASTP.out.reads,bwa_index.out) 

    gatk_markduplicates(align_bwa_mem.out.sorted_bams)
}

align_bwa_mem

process align_bwa_mem {

    tag {sample_id}
    debug true

    publishDir params.outdir_bwa_mem , mode: "copy"

    input :
    tuple val(sample_id), path(reads)
    tuple val(idxbase), path("bwa_index/*")

    output: 
    path("${sample_id}.sorted.bam"), emit : sorted_bams 

    script:
    def (fq1,fq2)=reads

    rg="\'@RG\\tID:${sample_id}\\tSM:${sample_id}\\tPL:illumina\'"
    
    """

    bwa mem -M -R $rg -v 1 "bwa_index/${idxbase}" $fq1 $fq2 | samtools sort -O bam -T - > ${sample_id}.sorted.bam  > ${sample_id}.bam

    """
}

mark_picard_duplicates.nf

process gatk_markduplicates {

    tag {sample_id}
debug true

    publishDir params.gatk_mark_duplicates , mode:"copy"

    input:
tuple val(sample_id), path(x) // works fine with only `val(sample_id)`

    output:
stdout

script:
"""
    echo "$x\n"
    echo "$sample_id\n"

"""
}

Error I get is:

    WARN: Input tuple does not match input set cardinality declared by process `gatk_markduplicates` -- offending value: /Users/username/Documents/name/nextflow_scripts/pipeline/work/2a/b24c4f7fe752389dda1085dace6509/tiny_t_L007.sorted.bam
    ERROR ~ Error executing process > 'gatk_markduplicates (1)'
    
    Caused by:
      Not a valid path value type: org.codehaus.groovy.runtime.NullObject (null)

Tip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`

 -- Check '.nextflow.log' file for details

It works fine with val(sample_id) or path(sample_id) instead of tuple val(sample_id), path(sorted.bam) I'd like to capture sample_id name and the path of the sorted_bam file generated in the previous step.

I think tuple should have worked seamlessly here to get sample_id and path of the input file. I cannot understand the cause that fails it.


Solution

  • The problem is that align_bwa_mem declares only a channel of paths, but gatk_markduplicates is expecting a channel of tuples. In order to feed the sorted_bams channel directly into gatk_markduplicates, we can simply have align_bwa_mem produce a tuple:

    output: 
    tuple val(sample_id), path("${sample_id}.sorted.bam"), emit : sorted_bams 
    

    Otherwise, the other way would be to use the map operator, for example, to produce the required tuples:

    workflow {
    
        ...
    
        align_bwa_mem.out.sorted_bams \
            | map { tuple( it.getBaseName(2), it ) } \
            | gatk_markduplicates
    }
    

    The first approach should be preferred.