Search code examples
pythonbioinformaticssnakemake

Snakemake optional output with expand()


I am new to Snakemake, and I'm wondering if I'm able to put optional output files in a snakemake rule while using expand().

I'm using bowtie2-build to create an indexing of my reference genome, but depending on the genome size, bowtie2 creates indexing files with different extensions: .bt2 for small genomes, and .bt21 for big genomes.

I have the following rule:

rule bowtie2_build:
    """
    """
    input:
        "reference/"+config["reference_genome"]+".fa"
    output:
                                                                   # every possible extension in expand
        expand("reference/"+config["reference_genome"]+"{suffix}", suffix=[".1.bt2", ".2.bt2", ".3.bt2", ".4.bt2", ".rev.1.bt2", ".rev.2.bt2", ".1.bt21", ".2.bt21", ".3.bt21", ".4.bt21", ".rev.1.bt21", ".rev.2.bt21"])
    params:
        output_prefix=config["reference_genome"]
    shell:
        "bowtie2-build {input} reference/{params.output_prefix}"

But now, snakemake will always look for output with all extensions, which will give an error while running, because only files with either one of the two extensions .bt2 or .bt21 are actually created depending on genome size.

I have tried to use regex like so:

output:
        "reference/"+config["reference_genome"]+"{suffix, \.(1|2|3|4|rev)\.(bt2|bt21|1|2)\.?(bt2|bt21)?}"

And this works, but I feel like there should be an easier way for it.


Solution

  • Perhaps you are making things more complicated than necessary. Bowtie2 (the aligner) takes in input the prefix of the index files and it will find the actual index files by itself. So I wouldn't list the index files as output of any rule. Just use a flag file to indicate that indexing has been completed. For example:

    rule all:
        input:
            expand('{sample}.sam', sample= ['A', 'B', 'C']),
    
    rule bowtie_build:
        input:
            fa= 'genome.fa',
        output:
            touch('index.done'), # Flag file
        params:
            index_prefix= 'genome',
        shell:
            r"""
            bowtie2-build {input.fa} {params.index_prefix}
            """
    
    rule bowtie_align:
        input:
            idx= 'index.done',
            fq1= '{sample}.R1.fastq.gz',
        output:
            sam= '{sample}.sam',
        params:
            index_prefix= 'genome'
        shell:
            r"""
            bowtie2 -x {params.index_prefix} -U {input.fq1} -S {output.sam}
            """