Search code examples
pythonpipelineworkflowsnakemake

Why won't snakemake re-run workflow if I touch the input file with size less than 100,000 bytes?


Version

Snakemake version: 8.16.0

Snakefile

rule test:
    input:
        "input.txt"
    output:
        "output.txt"
    shell:
        """
cat {input} > {output}
        """

Size of input.txt is less than 100,000 bytes (e.g. 99,999 bytes)

Process

  1. First run snakemake -c1, everything is ok, the output.txt will be created.
  2. Run touch input.txt
  3. Run snakemake -c1 again, snakemake will prompt Nothing to be done (all requested files are present and up to date). (This is not the result I expected.)
  4. Add any character to input.txt, let its size equals to(or more than) 100,000 bytes
  5. Run snakemake -c1, rule test will run again, because the input file has been changed.
  6. Run touch input.txt. (Same as step 2)
  7. Run snakemake -c1 again, rule test will run again, because modification date of the input file has been changed.

Question

Why won't snakemake re-run workflow if I touch the input file with size less than 100,000 bytes? Are there some ways to let snakemake re-run, as long as I touch any input file?

I have try above steps in 2 devices and get the same results.


Solution

  • The behaviour is coded in the Snakemake source here:

    https://github.com/snakemake/snakemake/blob/e8735c1477a2a82110757ba86bbd1ccbcaf327ba/snakemake/io.py#L600

    Note that the size cutoff is hard-coded and cannot be turned off. You might think that running Snakemake with the option --rerun-triggers mtime would ignore the file checksum, but it does not.

    I think there was some discussion about this within one of the many many open Snakemake bugs, but I don't have the link to hand. At the very least the behaviour should be properly documented.

    There is a workaround that may be useful for you. Run Snakemake with the --drop-metadata option so that the checksums will not be recorded. Changes to the code will also be untracked. This is basically the same as deleting the '.snakemake' directory between runs, but you'll not lose conda environments, locks, etc. I've put this in my own default profile as the checksumming logic was causing me problems too. Possibly this will cause problems with incomplete jobs not being detected, but I tend to use shadow rules for anything where that is a concern.