Search code examples
nextflow

Lost coordinate file name in process all outputs altogether


def barcodes = (1..2).collect { String.format("barcode%02d", it) }
params.orifq = barcodes.collect { "fastq_pass/$it/*.fastq.gz" }

Channel
.fromPath(params.orifq)
.map { it -> [it.name.split("_")[2], it] }
.groupTuple()
.set{orifq_ch}

process cat {
debug true

publishDir = [ path: "Run/orifq", mode: 'copy' ]

input:
tuple val(bc), path(fq)

output:
path("*.fastq.gz")

"""
cat ${fq} > ${bc}.fastq.gz
"""
}

process all_stats {
debug true

publishDir = [ path: "Run/stats", mode: 'copy' ]

input:
path ("*.fastq.gz")

output:
path ("all_stats.txt"), emit: all_stats

"""
seqkit stat *.fastq.gz > all_stats.txt
"""

}
workflow {
cat(orifq_ch)|collect|all_stats|view
{

In this code, process cat generated barcode01.fastq.gz and barcode02.fastq.gz, then all outputs from the precess cat were processed altogether in all_stats.

however, the all_stats.txt result in the file column showed 1.fastq.gz instead of barcode01.fastq.gz and the number 1 seems to be the FIFO serial number not the barcode number.

How to fix the code so the barcode number is correctly assigned?


Solution

  • Nextflow will rewrite input file names when a named pattern is used to declare a collection of files. In this case, the named pattern provided is "*.fastq.gz". Note that the * wildcard is used to control the names of staged files. Otherwise (from multiple input files):

    When the input has a fixed file name and a collection of files is received by the process, the file name will be appended with a numerical suffix representing its ordinal position in the list.

    However, the rewriting of input file names is completely optional. Instead, you can just use a regular variable to bind the collection of files. This can then be used accordingly in your process script, for example (untested):

    params.reads = './fastq_pass/barcode{0[8-9],[1-5][0-9],6[0-4]}/*.fastq.gz'
    params.outdir = './results'
    
    
    process cat {
    
        publishDir "${params.outdir}/orifq", mode: 'copy'
    
        input:
        tuple val(bc), path(fq)
    
        output:
        path "${bc}.fastq.gz"
    
        """
        cat ${fq} > "${bc}.fastq.gz"
        """
    }
    
    process all_stats {
    
        publishDir "${params.outdir}/stats", mode: 'copy'
    
        input:
        path fastq_files
    
        output:
        path "all_stats.txt"
    
        """
        seqkit stat ${fastq_files} > all_stats.txt
        """
    }
    
    workflow {
    
        Channel.fromPath( params.reads )
            .map { it -> [it.name.split("_")[2], it] }
            .groupTuple()
            .set { orifq_ch }
    
        ...
    }
    

    The reads pattern above will match barcodes 08 to 64 inclusive. It requires breaking the range down into multiple patterns and uses curly braces for each part.