Search code examples
snakemake

How to skip intermediate rules with ancient?


I want to skip intermediate rules if their output file already exists.

For example, I want to skip the second rule if it ran before in this basic Snakefile:

rule third:
    input:
        "b",
        "c"
    output: "d"
    shell: "touch {output}"

rule second:
    input: ancient("b")
    output: "c"
    shell: "touch {output}"

rule first:
    input: "a"
    output: "b"
    shell: "touch {output}"

After touching a, this pipeline works fine.

When I touch a again, I would expect that the second rule will be skipped, as the input is flagged as ancient. This is not what happens, instead the while pipeline is run again.

Flagging c in the third rule instead of b in the second rule as ancient also has no effect.

Update: Using the suggestion, I updated the Snakefile to skip the second rule if it ran before (and is not forced):

ruleorder: cached_c > second

rule third:
    input:
        "b",
        "c"
    output: "d"
    shell: "touch {output}"

rule second:
    input: "b"
    output:
        c = "c"
    shell: '''
    touch {output.c}
    cp {output.c} cache
    '''

rule cached_c:
    input: "cache"
    output: "c"
    shell: "cp {input} {output}"

rule first:
    input: "a"
    output: "b"
    shell: "touch {output}"

Solution

  • Why are you assuming that the second rule would be skipped? The ancient modifier does the opposite:

    For determining whether output files have to be re-created, Snakemake checks whether the file modification date (i.e. the timestamp) of any input file of the same job is newer than the timestamp of the output file. This behavior can be overridden by marking an input file as ancient. The timestamp of such files is ignored and always assumed to be older than any of the output files

    So instead of always skipping the rule you are always running it, even if nothing has changed.

    The fact that the rule has a dependency means that you expect the output of this rule to be reevaluated whenever the input is changed. You may break this dependency manually by creating a separate rule that gets it's input from cache. The cached version has to have no dependencies (thus would never be reevaluated), and you may manually copy the files whenever you decide they are good enough for being cached.

    One alternative is to copy the files into the cache directory automatically after each successful run (or before each run if you find the artefacts): https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#onstart-onsuccess-and-onerror-handlers

    onsuccess:
        # Copy your files here
    

    Update: The quote from the Snakemake reference that I provided in my message is ambiguous, and could be understood in two ways. It doesn't say however what does Snakemake do with the timestamps of ancient inputs linking them to other rules that may affect this file.

    Regarding caching, personally I haven't touched the Between workflow caching, but that is not what I meant. Snakemake has no idea whether the changed source would affect the cached file unless it reevaluates the file and compares the hashes. If you wish to avoid that (and you are pretty sure that the change in the source wouldn't affect the rest of the pipeline), you may create two separate branches to create a target. For example, in your case you need a file called c: I advise you to have two separate rules for that: the first takes b as an input, the other takes cached_b. You need to disambiguate these two rules with rulesorder giving the cached version priority. The strategy how this cached_b is being created is up to you: either manualy copy b to cached_b or a script that is being run onsuccess.