Lost coordinate file name in process all outputs altogether

def barcodes = (1..2).collect { String.format("barcode%02d", it) }
params.orifq = barcodes.collect { "fastq_pass/$it/*.fastq.gz" }

Channel
.fromPath(params.orifq)
.map { it -> [it.name.split("_")[2], it] }
.groupTuple()
.set{orifq_ch}

process cat {
debug true

publishDir = [ path: "Run/orifq", mode: 'copy' ]

input:
tuple val(bc), path(fq)

output:
path("*.fastq.gz")

"""
cat ${fq} > ${bc}.fastq.gz
"""
}

process all_stats {
debug true

publishDir = [ path: "Run/stats", mode: 'copy' ]

input:
path ("*.fastq.gz")

output:
path ("all_stats.txt"), emit: all_stats

"""
seqkit stat *.fastq.gz > all_stats.txt
"""

}
workflow {
cat(orifq_ch)|collect|all_stats|view
{

In this code, process cat generated barcode01.fastq.gz and barcode02.fastq.gz, then all outputs from the precess cat were processed altogether in all_stats.

however, the all_stats.txt result in the file column showed 1.fastq.gz instead of barcode01.fastq.gz and the number 1 seems to be the FIFO serial number not the barcode number.

How to fix the code so the barcode number is correctly assigned?

Solution

Nextflow will rewrite input file names when a named pattern is used to declare a collection of files. In this case, the named pattern provided is "*.fastq.gz". Note that the * wildcard is used to control the names of staged files. Otherwise (from multiple input files):

When the input has a fixed file name and a collection of files is received by the process, the file name will be appended with a numerical suffix representing its ordinal position in the list.

However, the rewriting of input file names is completely optional. Instead, you can just use a regular variable to bind the collection of files. This can then be used accordingly in your process script, for example (untested):

params.reads = './fastq_pass/barcode{0[8-9],[1-5][0-9],6[0-4]}/*.fastq.gz'
params.outdir = './results'


process cat {

    publishDir "${params.outdir}/orifq", mode: 'copy'

    input:
    tuple val(bc), path(fq)

    output:
    path "${bc}.fastq.gz"

    """
    cat ${fq} > "${bc}.fastq.gz"
    """
}

process all_stats {

    publishDir "${params.outdir}/stats", mode: 'copy'

    input:
    path fastq_files

    output:
    path "all_stats.txt"

    """
    seqkit stat ${fastq_files} > all_stats.txt
    """
}

workflow {

    Channel.fromPath( params.reads )
        .map { it -> [it.name.split("_")[2], it] }
        .groupTuple()
        .set { orifq_ch }

    ...
}

The reads pattern above will match barcodes 08 to 64 inclusive. It requires breaking the range down into multiple patterns and uses curly braces for each part.