I am a newcomer to Nextflow and I am trying to process multiple files in a workflow. The number of these files is more than 300, so I would like to not to paste it into a command line as an option. So what I have done is I've created a file with every filename of the files I need to process, but I am not sure how to pass it into the process. This is what I've tried:
params.SRRs = "srr_ids.txt"
process tmp {
input:
file ids
output:
path "*.txt"
script:
'''
while read id; do
touch ${id}.txt;
echo ${id} > ${id}.txt;
done < $ids
'''
}
workflow {
tmp(params.SRRs)
}
The script is supposed to read in the file srr_ids.txt
, and create files that have their ids in it (just testing on a smaller task). The error log says that the id variable is unbound, but I don't understand why. What is the conventional way of passing lots of filenames to a pipeline? Should I write some other process that parses the list?
Maybe there's a typo in your question, but the error is actually that the ids
variable is unbound:
Command error:
.command.sh: line 5: ids: unbound variable
The problem is that when you use a single-quote script string, you will not be able to access Nextflow variables in your script
block. You can either define your script using a double-quote string and escape your shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
script:
"""
while read id; do
touch "\${id}.txt"
echo "\${id}" > "\${id}.txt"
done < "${ids}"
"""
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
Or, use a shell
block which uses the exclamation mark !
character as the variable placeholder for Nextflow variables. This makes it possible to use both Nextflow and shell variables in the same piece of code without having to escape each of the shell variables:
params.SRRs = "srr_ids.txt"
process tmp {
input:
path ids
output:
path "*.txt"
shell:
'''
while read id; do
touch "${id}.txt"
echo "${id}" > "${id}.txt"
done < "!{ids}"
'''
}
workflow {
SRRs = file(params.SRRs)
tmp(SRRs)
}
What is the conventional way of passing lots of filenames to a pipeline?
The conventional way, I think, is to actually supply one (or more) glob patterns to the fromPath
channel factory method. For example:
params.SRRs = "./path/to/files/SRR*.fastq.gz"
workflow {
Channel
.fromPath( params.SRRs )
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [sleepy_bernard] DSL2 - revision: 30020008a7
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910483.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1910482.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448795.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448793.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448794.fastq.gz
/home/steve/working/stackoverflow/73702711/path/to/files/SRR1448792.fastq.gz
If instead you would prefer to pass in a list of filenames, like in your example, use either the splitCsv
or the splitText
operator to get what you want. For example:
params.SRRs = "srr_ids.txt"
workflow {
Channel
.fromPath( params.SRRs )
.splitText() { it.strip() }
.view()
}
Results:
$ nextflow run main.nf
N E X T F L O W ~ version 22.04.4
Launching `main.nf` [fervent_ramanujan] DSL2 - revision: 89a1771d50
SRR1448794
SRR1448795
SRR1448792
SRR1448793
SRR1910483
SRR1910482
Should I write some other process that parses the list?
You may not need to. My feeling is that your code might benefit from using the fromSRA
factory method, but we don't really have enough details to say one way or the other. If you need to, you could just write a function that returns a channel.