Search code examples
containersworkflowpipelinebioinformaticsnextflow

Nextflow rename barcodes and concatenate reads within barcodes


My current working directory has the following sub-directories

enter image description here

My Bash script

enter image description here

Hi there

I have compiled the above Bash script to do the following tasks:

  • rename the sub-directories (barcode01-12) taking information from the metadata.csv
  • concatenate the individual reads within a sub-directory and move them up in the $PWD
  • then I use these concatenated reads (one per barcode) for my Nextflow script below:

Query:

How can I get the above pre-processing tasks (renaming and concatenating) or the Bash script added at the beginning of my following Nextflow script?

enter image description here


Solution

  • In my experience, FASTQ files can get quite large. Without knowing too much of the specifics, my recommendation would be to move the concatenation (and renaming) to a separate process. In this way, all of the 'work' can be done inside Nextflow's working directory. Here's a solution that uses the new DSL 2. It uses the splitCsv operator to parse the metadata and identify the FASTQ files. The collection can then be passed into our 'concat_reads' process. To handle optionally gzipped files, you could try the following:

    params.metadata = './metadata.csv'
    params.outdir = './results'
    
    process concat_reads {
    
        tag { sample_name }
    
        publishDir "${params.outdir}/concat_reads", mode: 'copy'
    
        input:
        tuple val(sample_name), path(fastq_files)
    
        output:
        tuple val(sample_name), path("${sample_name}.${extn}")
    
        script:
        if( fastq_files.every { it.name.endsWith('.fastq.gz') } )
            extn = 'fastq.gz'
        else if( fastq_files.every { it.name.endsWith('.fastq') } )
            extn = 'fastq'
        else
            error "Concatentation of mixed filetypes is unsupported"
    
        """
        cat ${fastq_files} > "${sample_name}.${extn}"
        """
    }
    
    process pomoxis {
    
        tag { sample_name }
    
        publishDir "${params.outdir}/pomoxis", mode: 'copy'
    
        cpus 18
    
        input:
        tuple val(sample_name), path(fastq)
    
        """
        mini_assemble \\
            -t ${task.cpus} \\
            -i "${fastq}" \\
            -o results \\
            -p "${sample_name}"
        """
    }
    
    workflow {
    
        fastq_extns = [ '.fastq', '.fastq.gz' ]
    
        Channel.fromPath( params.metadata )
            | splitCsv()
            | map { dir, sample_name ->
    
                all_files = file(dir).listFiles()
    
                fastq_files = all_files.findAll { fn ->
                    fastq_extns.find { fn.name.endsWith( it ) }
                }
    
                tuple( sample_name, fastq_files )
            }
            | concat_reads
            | pomoxis
    }