Search code examples
snakemake

Snakemake rule with one input and several outputs


I call "converging" a rule that creates one output from multiple inputs:

group2samples = {
    "A": ["s1", "s2"],
    "B": ["s3", "s4"]}

rule all:
    input: [f"{group}.txt" for group in group2samples]

def set_input(wildcards):
    return [f"{sample}.txt" for sample in group2samples[wildcards.group]]

rule converging:
    input:
        set_input
    output:
        "{group}.txt"
    shell:
        "cat {input} > {output}"

I would like to create a "diverging" rule instead of a converging one. For instance (likely invalid snakemake code):

group2samples = {
    "A": ["s1", "s2"],
    "B": ["s3", "s4"]}

rule all:
    input:
        [
            f"{group}/{sample}.txt"
            for sample in group2samples[group]
            for group in group2samples]

rule diverging:
    input:
        "{group}.txt"
    output:
        # Something like
        lambda wildcards: [f"{{group}}/{sample}.txt" for sample in group2samples[wildcards.group]]
        # (but I don't think output functions of wildcards are possible)
    shell:
        "my_data_extracting_script.py {input}"

One possibly valid way of proceeding I can think of so would be to put the desired outputs in an archive, and the archive would be the actual output of the rule:

group2samples = {
    "A": ["s1", "s2"],
    "B": ["s3", "s4"]}

rule all:
    [f{group}.tar.bz2" for group in group2samples]

rule diverging:
    input:
        "{group}.txt"
    output:
        "{group}.tar.bz2"
    shell:
        "my_data_extracting_and_archiving_script.py {input}"

But it would be more convenient to have separate files rather than an archive.

Another similar idea would be to use directories as outputs:

group2samples = {
    "A": ["s1", "s2"],
    "B": ["s3", "s4"]}

rule all:
    [f{group}_dir" for group in group2samples]

rule diverging:
    input:
        "{group}.txt"
    output:
        directory("{group}_dir")
    shell:
        "my_data_extracting_script.py --outdir {output} {input}"

But I find it better to have explicit lists of files. Besides, if I recall correctly, the documentation discourages the use of directory.

Is there a better way?


Solution

  • This is most easily accomplished with dynamically created rules. Only downside is you may generate lots of rules; if that's the case consider leaving them unnamed.

    group2samples = {
        "A": ["s1", "s2"],
        "B": ["s3", "s4"]}
    
    rule all:
        input:
            [
                f"{group}/{sample}.txt"
                for sample in group2samples[group]
                for group in group2samples]
    
    for group in group2samples:
        rule:
            name: f"diverging_{group}"  # can remove to leave unnamed
            input:
                f"{group}.txt"  # notice f string here
            output:
                expand("{group}/{sample}.txt", sample=group2samples[group], group=group)
            shell:
                "my_data_extracting_script.py {input}"
    

    You are effectively hard coding the inputs and outputs outside the normal wildcard mechanism. I think the more standard names for these rules are scatter and gather operations.