Search code examples
snakemake

How does snakemake handle possible corruptions due to a rule run in parallel simultaneously appending to a single file?


I would like to learn how snakemake handles following situations, and what is the best practice avoid collisions/corruptions.

rule something:
    input:
            expand("/path/to/out-{asd}.txt", asd=LIST)
    output:
            "/path/to/merged.txt"
    shell:
            "cat {input} >> {output}"

With snakemake -j10 the command will try to append to the same file simultaneously, and I could not figure out if this could lead to possible corruptions or if this is already handled.

Also, how are more complicated cases handled e.g. where it is not only cat but a return value of another process based on input value being appended to the same file? Is the best practice first writing them to individual files then catting them together?

rule get_merged_total_distinct:
    input:
        expand("{dataset_id}/merge_libraries/{tomerge}_merged_rmd.bam",dataset_id=config["dataset_id"],tomerge=list(TOMERGE.keys())),
    output:
        "{dataset_id}/merge_libraries/merged_total_distinct.csv"
    params:
        with_dups="{dataset_id}/merge_libraries/{tomerge}_merged.bam"
    shell:
        """
        RCT=$(samtools view -@4 -c -F1 -F4 -q 30 {params.with_dups})
        RCD=$(samtools view -@4 -c -F1 -F4 -q 30 {input})
        printf "{wildcards.tomerge},${{RCT}},${{RCD}}\n" >> {output}
        """

or cases where an external script is being called to print the result to a single output file?

    input:
        expand("infile/{x}",...) # expanded as above
    output:
        "results/all.txt"
    shell:
        """
        bash script.sh {params.x} {input} {params.y} >> {output}
        """

Solution

  • With your example, the shell directive will expand to

    cat /path/to/out-SAMPLE1.txt /path/to/out-SAMPLE2.txt [...] >> /path/to/merged.txt
    

    where SAMPLE1, etc, comes from the LIST. In this case, there is no collision, corruption, or race conditions. One thread will run that command as if you typed it on your shell and all inputs will get cated to the output. Since snakemake is pull based, once the output exists that rule will only run again if the inputs change at which points the new inputs will be added to the old due to using >>. As such, I would recommend using > so the old contents are removed; rules should be deterministic where possible.

    Now, if you had done something like

    rule something:
            input:
                    "/path/to/out-{asd}.txt"
            output:
                    touch("/path/to/merged-{asd}.txt")
            params:
                    output="/path/to/merged.txt"
            shell:
                    "cat {input} >> {params.output}"
    # then invoke
    snakemake -j10 /path/to/merged-{a..z}.txt
    

    Things are more messy. Snakemake will launch all 10 jobs and output to the single merged.txt. Note that file is now a parameter and we are targeting some dummy files. This will behave as if you had 10 different shells and executed the commands

    cat /path/to/out-a.txt >> /path/to/merged.txt
    # ...
    cat /path/to/out-z.txt >> /path/to/merged.txt
    

    all at once. The output will have a random order and lines may be interleaved or interrupted.

    As some guidance

    • Try to make outputs deterministic. Given the same inputs you should always produce the same outputs. If possible, set random seeds and enforce input ordering. In the second example, you have no idea what the output will be.
    • Don't use the append operator. This follows from the first point. If the output already exists and needs to be updated, start from scratch.
    • If you need to append a bunch of outputs, say log files or to create a summary, do so in a separate rule. This again follows from the first point, but it's the only reason I can think of to use append.

    Hope that helps. Otherwise you can comment or edit with a more realistic example of what you are worried about.