I am new to Snakemake, and I'm wondering if I'm able to put optional output files in a snakemake rule while using expand()
.
I'm using bowtie2-build
to create an indexing of my reference genome, but depending on the genome size, bowtie2 creates indexing files with different extensions: .bt2
for small genomes, and .bt21
for big genomes.
I have the following rule:
rule bowtie2_build:
"""
"""
input:
"reference/"+config["reference_genome"]+".fa"
output:
# every possible extension in expand
expand("reference/"+config["reference_genome"]+"{suffix}", suffix=[".1.bt2", ".2.bt2", ".3.bt2", ".4.bt2", ".rev.1.bt2", ".rev.2.bt2", ".1.bt21", ".2.bt21", ".3.bt21", ".4.bt21", ".rev.1.bt21", ".rev.2.bt21"])
params:
output_prefix=config["reference_genome"]
shell:
"bowtie2-build {input} reference/{params.output_prefix}"
But now, snakemake will always look for output with all extensions, which will give an error while running, because only files with either one of the two extensions .bt2
or .bt21
are actually created depending on genome size.
I have tried to use regex like so:
output:
"reference/"+config["reference_genome"]+"{suffix, \.(1|2|3|4|rev)\.(bt2|bt21|1|2)\.?(bt2|bt21)?}"
And this works, but I feel like there should be an easier way for it.
Perhaps you are making things more complicated than necessary. Bowtie2 (the aligner) takes in input the prefix of the index files and it will find the actual index files by itself. So I wouldn't list the index files as output of any rule. Just use a flag file to indicate that indexing has been completed. For example:
rule all:
input:
expand('{sample}.sam', sample= ['A', 'B', 'C']),
rule bowtie_build:
input:
fa= 'genome.fa',
output:
touch('index.done'), # Flag file
params:
index_prefix= 'genome',
shell:
r"""
bowtie2-build {input.fa} {params.index_prefix}
"""
rule bowtie_align:
input:
idx= 'index.done',
fq1= '{sample}.R1.fastq.gz',
output:
sam= '{sample}.sam',
params:
index_prefix= 'genome'
shell:
r"""
bowtie2 -x {params.index_prefix} -U {input.fq1} -S {output.sam}
"""