I am new to Snakemake, trying to port my BASH code to work with Snakemake rules. Disclaimer out of the way, I am attempting to demultiplex a fastq file with cutadapt in a shell block, using a fasta file with headers and adapter sequences. Samplenames matching the fasta headers are loaded in and specified as a wild card in rule all:
rule all:
input:
expand([os.path.join(outDir, "reads_trim.fq"),
os.path.join(outDir, "{name}.tmp")], name=samples_list)
The reads_trim file is a standard, concatenated fastq file (from Nanopore sequencing). My problem is that this rule is run 12 times, since I have 12 samples in my "samples_list". I then get 12 output files, as I should, with correct naming, but they all contain the exact same. I wonder if output would be correct if it would run just once instead, but haven't found a way to force that behaviour.
rule demultiplex:
""" Demultiplex Nextera-styled barcodes using the compiled barcode fasta file """
input:
barcodeFile=os.path.join(outDir, "barcodes_used.fasta"),
fastqFile=os.path.join(outDir, "reads_trim.fq")
output: os.path.join(outDir, "{name}.tmp")
message: "Demultiplexing trimmed reads"
log: "logs/cutadapt/demultiplexing_{name}.log"
shell:
"""
cutadapt -e 0.20 -O 10 -m 15 -M 15 -g file:{input.barcodeFile} -o {output} {input.fastqFile}
"""
Any idea how to get the standard behaviour where one file goes in, x files come out with different output (where the adapter was found)?
I'm not familiar with cutadapt for demultiplexing but from a quick look at the docs, I think you want something along these lines (I simplified your original code):
rule all:
input:
expand("{name}.tmp", name= samples_list),
rule demultiplex:
input:
barcodeFile= "barcodes_used.fasta",
fastqFile= "reads_trim.fq"
output:
expand("{name}.tmp", name=samples_list),
shell:
r"""
cutadapt ... -g file:{input.barcodeFile} -o {{name}}.tmp {input.fastqFile}
"""
If the input reads_trim.fq
produces all the 12 output files in one shot, then
you need to list all the 12 outputs in the output
directive. I used the
expand()
function for that. Note that {{name}}.tmp
has double curly braces
to tell snakemake that it's not a wildcard but it's to be read exactly as
{name}
(as per cutadapt docs if I read them right).
By the way, are you sure that cutadapt can demultiplex Nanopore?