Search code examples
snakemake

How can I run a subset of my snakemake rules several times with wildcards?


I have a snakemake pipeline in which the input files are divided into two groups - input I would like to pass through the entire pipeline (true input) and input that should only pass through the first few rules (control input). How can I pass the true input through all rules and the control input only through the first few?

The most obvious solution would be delegation i.e. running all rules on the first group (true) and then copy-pasting the rules that I want to run on the second group (control) and providing these with the second group of input separately.

However I think this isn't good practice for code maintainability and I would much prefer a solution that utilised wildcards somehow.

The code below is a simplification of the problem with less rules:


INPUT = [NAME1, NAME2, NAME3, CONTROL]
LABELS = [A, B, C, D]

rule all:
    input:
        expand("output/{input}_results.txt",
            input = INPUT)

rule split_data:
    '''
    Read the true input and control then split them
    '''
    input:
        "data/{input}.txt"
    output:
        "data/{input}/{label}.txt", label = LABELS)
    script:
        "scripts/split_data.py"

rule run_true_data:
    '''
    Read only the true split and produce results.
    '''
    input:
        "data/{{input}}/{label}.txt", label = LABELS)
    output:
         "output/{input}_results.txt"
    script:
        "scripts/produce_results.py"

In the ideal version of the above, the input wildcard should produce [NAME1, NAME2, NAME3, CONTROL] for split_data only. Whilst run_true_data and all should receive only [NAME1, NAME2, NAME3].

In addition the labels should be generated depending on the wildcard (with a lambda, for example) but this is not important for now so I didn't include it to avoid confusing things.


Solution

  • Maybe you need to add some more details on the exact nature of the problem. Incase you just need a different set of input for your second rule why not just add another wildcard for this step which limits the input to only required entries. Something along the lines of below script

    INPUT = ["NAME1", "NAME2", "NAME3", "CONTROL"]
    TRUE_INPUT = ["NAME1", "NAME2", "NAME3"]
    LABELS = ["A", "B", "C", "D"]
    
    rule all:
        input:
            expand("data/{input}/{label}.txt",
                input = INPUT, label = LABELS),
            expand("output/{true_input}_results.txt", true_input = TRUE_INPUT)
    
    rule split_data:
        '''
        Read the true input and control then split them
        '''
        input:
            "data/{input}.txt"
        output:
            "data/{input}/{label}.txt"
        script:
            "scripts/split_data.py"
    
    rule run_true_data:
        '''
        Read only the true split and produce results.
        '''
        input:
            lambda wildcards: ["data/{}/{}.txt".format(wildcards.true_input, label) for label in LABELS]
        output:
             "output/{true_input}_results.txt"
        script:
            "scripts/produce_results.py"
    

    In this way you should be able to control the labels and the input for the rule run_true_data