I have two nextflow scripts that work independently, how do I combine them into one such that output files from process TRIM can be used as input files for process BWA. The process BWA must not start until paired trimmed files have been created.
Both scripts use paired files (for each mate of paired end sequencing), so not sure how to use Channel.fromFilePairs for chaining the two scripts into one.
nextflow script 1 this script trims reads and stores the output in trimmed/
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process TRIM {
debug true
publishDir 'trimmed/', mode: 'copy', overwrite: false, pattern: "*"
input:
tuple val(sampleId), file(reads)
output:
path "*"
script:
"""
trim_galore --paired -q 30 --length 30 --fastqc --phred33 $reads
"""
}
workflow {
Channel.fromFilePairs("$baseDir/subset_fq/*_{1,2}.fq.gz", checkIfExists:true) \
| TRIM
}
nextflow script 2 this script uses trimmed reads from trimmed/ and outputs sam files in bwa_aligned/
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
process BWA {
debug true
publishDir 'bwa_aligned/', mode: 'copy', overwrite: false, pattern: "*"
input:
tuple val(sampleId), file(reads)
output:
path "*"
script:
"""
bwa mem /exports/eddie/scratch/pdewari/hamour/genome/fEpiCoi_cnag1_curated_primary.no_mt.fa $reads -t 2 > ${sampleId}.sam
"""
}
workflow {
Channel.fromFilePairs("$baseDir/trimmed/*_{1,2}_val_{1,2}.fq.gz", checkIfExists:true) | BWA
}
The nextflow documentation shows lots of good examples and is very thorough, and I highly recommend checking that out whenever you get stuck and for best practices.
The workflow invocation is how you can send files from one process to another in DSL2 nextflow. But you need to remove the workflow invocations from script 1 and 2, and then create a workflow script (you can also put everything into one script, but I find it's easier to modify and add if its modular).
Script 1 becomes this:
#!/usr/bin/env nextflow
process TRIM {
debug true
tag "$sampleId"
publishDir 'trimmed/', mode: 'copy', overwrite: false
input:
tuple val(sampleId), path(reads)
output:
tuple val(sampleId), path(*_val_1.fq.gz), path(*_val_2.fq.gz), emit: trimmed_seqs
// you can add an output channel for fastqc here, as it doesn't need to be in the same channel as the reads.
script:
"""
trim_galore --paired -q 30 --length 30 --fastqc --phred33 $reads
"""
}
I've made some changes following the documentation and to make it easier to follow. Firstly, the file
qualifier is deprecated and path
is preferred. I've also added the tag
so you can see which process is running on each sample. And finally, the output is a tuple, that mimics the input structure. Also, using *
wildcard is dangerous as you're creating an output channel with each and every file in the working directory, not just the ones you want.
Script 2 becomes:
#!/usr/bin/env nextflow
process BWA {
tag "$sampleId"
debug true
publishDir 'bwa_aligned/', mode: 'copy', overwrite: false, pattern: "*"
input:
// your original input declaration didn't match the cardinality of your TRIM process output.
tuple val(sampleId), path(r1), path(r2)
output:
path "*" // I would update this to be less promiscuous
script:
"""
bwa mem /exports/eddie/scratch/pdewari/hamour/genome/fEpiCoi_cnag1_curated_primary.no_mt.fa $reads -t 2 > ${sampleId}.sam
"""
}
It's also not best practice to give nextflow absolute paths. The fEpiCoi_cnag1_curated_primary.no_mt.fa
file should be staged in each processing environment as a value channel.
And then finally, the workflow script.
#!/usr/bin/env nextflow
nextflow.enable.dsl=2
// These paths are relative to the workflow .nf file
include { TRIM } from './modules/trim.nf'
include { BWA } from './modules/bwa.nf'
Channel
.fromFilePairs("$baseDir/subset_fq/*_{1,2}.fq.gz", checkIfExists:true)
.set{ paired_reads }
workflow Mapping {
main:
TRIM(paired_reads)
BWA(TRIM.out.trimmed_seqs)
}
NB. None of these scripts have been tested.