Search code examples
snakemake

Snakemake: a rule with batched inputs and corresponding outputs


I have the following basic structure of the workflow:

  • files are downloaded from a remote server,
  • converted locally and then
  • analyzed.

One of the analyses is time-consuming, but it scales well if run on multiple input files at a time. The output of this rule is independent of what files are analyzed together as a batch as long as they all share the same set of settings. Upstream and downstream rules operate on individual files, so from the perspective of the workflow, this rule is an outlier. What files are to be run together can told in advance, although ideally if some of the inputs failed to be produced along the way, the rule should be run on a reduced of files.

The following example illustrates the problem:

samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
    'A': samples[0:3],
    'B': samples[3:6]
}

rule all:
    input:
        expand("done/{sample}.txt", sample = samples)

rule create:
    output:
        "created/{sample}.txt"
    shell:
        "echo {wildcards.sample} > {output}"

rule analyze:
    input:
        "created/{sample}.txt"
    output:
        "analyzed/{sample}.txt"
    params:
        outdir = "analyzed/"
    shell:
        """
        sleep 1 # or longer
        parallel md5sum {{}} \> {params.outdir}/{{/}} ::: {input}
        """

rule finalize:
    input:
        "analyzed/{sample}.txt"
    output:
        "done/{sample}.txt"
    shell:
        "touch {output}"

The rule analyze is the one to produce multiple output files from multiple inputs according to the assignment in groups. The rules create and finalize operate on individual files upstream and downstream, respectively.

Is there a way to implement such logic? I'd try like to try to avoid splitting the workflow to accommodate this irregularity.

Note: this question is not related to the similar sounding question here.


Solution

  • If I understand correctly. rule analyze takes in input files created/a.txt, created/b.txt, created/c.txt for group A and gives in output analyzed/a.txt, analyzed/b.txt, analyzed/c.txt. The same for group B so rule analyze runs twice, everything else runs 6 times.

    If so, I make rule analyze output a dummy file signaling that files in group A (or B, etc.) has been analyzed. Downstream rules will take in input this dummy file and will find the corresponding analyzed/{sample}.txtavailable.

    Here's your example:

    samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
    groups = {
        'A': samples[0:3],
        'B': samples[3:6]
    }
    
    # Map samples to groups by inverting dict groups
    inv_groups = {}
    for x in samples:
        for k in groups:
            if x in groups[k]:
                inv_groups[x] = k
    
    rule all:
        input:
            expand("done/{sample}.txt", sample = samples)
    
    rule create:
        output:
            "created/{sample}.txt"
        shell:
            "echo {wildcards.sample} > {output}"
    
    rule analyze:
        input:
            # Collect input for this group (A, B, etc)
            grp= lambda wc: ["created/%s.txt" % x for x in groups[wc.group]]
        output:
            done = touch('created/{group}.done'),
        shell:
            """
            # Code that actually does the job...
            for x in {input.grp}
            do
                sn=`basename $x .txt`
                touch analyzed/$sn.txt
            done
            """
    
    rule finalize:
        input:
            # Get dummy file for this {sample}. 
            # If the dummy exists also the corresponding analyzed/{sample}.txt exists.
            done = lambda wc: 'created/%s.done' % inv_groups[wc.sample],
        output:
            fout= "done/{sample}.txt"
        params:
            fin= "analyzed/{sample}.txt",
        shell:
            "cp {params.fin} {output.fout}"