My current working directory has the following sub-directories
My Bash script
Hi there
I have compiled the above Bash script to do the following tasks:
Query:
How can I get the above pre-processing tasks (renaming and concatenating) or the Bash script added at the beginning of my following Nextflow script?
In my experience, FASTQ files can get quite large. Without knowing too much of the specifics, my recommendation would be to move the concatenation (and renaming) to a separate process. In this way, all of the 'work' can be done inside Nextflow's working directory. Here's a solution that uses the new DSL 2. It uses the splitCsv operator to parse the metadata and identify the FASTQ files. The collection can then be passed into our 'concat_reads' process. To handle optionally gzipped files, you could try the following:
params.metadata = './metadata.csv'
params.outdir = './results'
process concat_reads {
tag { sample_name }
publishDir "${params.outdir}/concat_reads", mode: 'copy'
input:
tuple val(sample_name), path(fastq_files)
output:
tuple val(sample_name), path("${sample_name}.${extn}")
script:
if( fastq_files.every { it.name.endsWith('.fastq.gz') } )
extn = 'fastq.gz'
else if( fastq_files.every { it.name.endsWith('.fastq') } )
extn = 'fastq'
else
error "Concatentation of mixed filetypes is unsupported"
"""
cat ${fastq_files} > "${sample_name}.${extn}"
"""
}
process pomoxis {
tag { sample_name }
publishDir "${params.outdir}/pomoxis", mode: 'copy'
cpus 18
input:
tuple val(sample_name), path(fastq)
"""
mini_assemble \\
-t ${task.cpus} \\
-i "${fastq}" \\
-o results \\
-p "${sample_name}"
"""
}
workflow {
fastq_extns = [ '.fastq', '.fastq.gz' ]
Channel.fromPath( params.metadata )
| splitCsv()
| map { dir, sample_name ->
all_files = file(dir).listFiles()
fastq_files = all_files.findAll { fn ->
fastq_extns.find { fn.name.endsWith( it ) }
}
tuple( sample_name, fastq_files )
}
| concat_reads
| pomoxis
}