Search code examples
pythonsnakemake

Snakemake rule input and output is a directory


The output of a tool I am using in one of the rules is a directory with many files. The inputs of the next rule are 2 files from that directory. when I try to build the DAG, I get the missing input error.

rule rule_1: #Line 62
    input:
        a="a.txt",
        b="b.txt"
    output:
        "directory_rule1"
    params:
        a = "10",
        b = "1000"
    log:
        "rule1.log"
    shell:
        "nohup python2 rule1.py --a {input.a} "
        "--b {input.b} "
        "--out {output} "
        "--a {params.a} "
        "--b {params.b) &> {log} "

rule rule2:
    input:
        a="directory_rule1/a.tsv",
        b="directory_rule1/b.tsv"
    output:
        "a.csv"
    params:
        d="500"
    log:
        "rule2.log"
    shell:
        "python3 rule2.py -a {input.a} -b {input.b} -threshold {params.d} &> {log} "

The error I get is

Building DAG of jobs...
MissingInputException in line 62 of pathtosnakefile/snakefile:
Missing input files for rule rule2:
    output: a.csv
    affected files:
        directory_rule1/a.tsv
        directory_rule1/b.tsv

I tried removing the output section from rule2 and pur dir in params section, or used directory() function in the output section. I still get the same eeror. How can I fix this?

Thanks!!


Solution

  • In rule_1 change:

        output:
            "directory_rule1"
    

    to:

        output:
            a="directory_rule1/a.tsv",
            b="directory_rule1/b.tsv",
    

    or something equivalent.

    The explanation is that before doing anything snakemake checks that rules are chained by input-output links without gaps. In your code, rule2 requires a.tsv and b.tsv but snakemake doesn't see any rule able to produce those files. You know rule_1 will do it but snakemake cannot know it and so it fails. This is a good thing because the pipeline fails immediately if that are gaps in the DAG and you are forced to write consistent pipelines. Another thing to keep in mind is that, with the exception of the first rule, the order of rules in your snakefile doesn't matter. What matters is the input-output chaining.