Search code examples
pythonsplitwildcardsnakemake

Snakemake unknown number of output files / defining wildcards based on intermediate files


to speed up a my workflow I would like to do something similar to snakemake-unknown-output-input-files-after-splitting-by-chromosome

SAMPLE = "sample_a"

rule all:
    expand("results/{sample}-final-combined.bam", sample=SAMPLE)


rule exclude_blacklist:
    input:
        region_file = "data/{sample}.bed",
        blacklist = "data/blacklists/universal_blacklist.bed"
    output:
        "results/{sample}_blacklist-excluded.bed"
    shell:
        """
        bedtools intersect -v -a {input.region_file} -b {input.blacklist} > {output}
        """

rule target_regions_by_chrom:
    input:
        region_file ="results/{sample}_blacklist-excluded.bed",
    output:
        temp("results/{sample}_{chrom}.bed")
    script:
        "scripts/split_region_by_chrom.py"

rule simulate_reads:
    input:
        "results/{sample}_{chrom}.bed"
    output:
        "results/{sample}_{chrom}.bam"
    script:
        "scripts/simulate_reads.py"

rule combine_regions_again:
    input:
        files = expand("results/{sample}_{chrom}.bam", sample=SAMPLE, chrom=get_chroms())
    output:
        "results/{sample}-final-combined.bam"
    script:
        "scripts/combine_bams.py"

The workflow should take a .bed file with unfiltered regions, exclude problematic regions from a blacklist, split the filtered .bed file by chromosome and then simulate reads for these regions (for simplicity I spare you the details), generating .bam files. Ultimately these .bam files should be combined again to an output file.

The main problem is how to solve the unknown number of chromosomes beforehand. If I take the input .bed file, I get some errors, because some files will be empty. I thought about using an input function (here get_chroms()) to recover the chromosome from the output .bed file ("results/{sample}_blacklist-excluded.bed") after excluding the blacklisted regions, but this leads to errors during the creation of the DAG because the file is not existent.

Does anybody have any suggestions how to solve these problems:

  1. split a .bed file by chromosome, while not knowing the present chromosomes
  2. defining a wildcard based on an intermediate file in the workflow

Any help would be greatly appreciated!


Solution

  • The Dynamic Files feature of Snakemake is probably what you need:

    Dynamic files can be used whenever one has a rule for which the number of output files is unknown before the rule was executed.

    You should wrap the "results/{sample}_{chrom}.bed" with dynamic() modifier in both input and output counterparts:

    rule target_regions_by_chrom:
        output:
            dynamic("results/{sample}_{chrom}.bed")
    
    rule simulate_reads:
        input:
            dynamic("results/{sample}_{chrom}.bed")