Search code examples
pipelinesnakemake

Snakemake syntax for multiple outputs with the use of checkpoint


I'm using snakemake to build a pipeline. I have a checkpoint that should produce multiple output files. These output files are later used in my rule all within expand. The thing is that I don't know the amount of files that will be produced and therefore can't specify a dataset in expand.

The files will be produced in a R-script.

Example:

rule all:
    input:
        expand(["results/{output}],
               output=????)



checkpoint rscript:
    input:
        "foo.input"
    output:
        report("somedir/{output}"),
    script:
        "../scripts/foo.R" 

Of course this is only a small part but I basically have a loop in my R-script to output multiple files in the somedir. But since I don't know how many and because they are firstly evaluated in the R script I can't set output in expand.

Maybe this is a really trivial question to some of you, or even a stupid question and there are better ways to do this. If that's the case I'd still be thankful cause I had problems understanding most of the snakemake functions because of my ability to comprehend the functions in english.

If there are more questions I'd gladly answer. (The best case for me would be to let output have names that I could specify in runtime within the R script)

(I also can't aggregate the created files in another rule, because each file will show a different plot)

Edit: The main problem still seems to be that checkpoint rscript is not able to create multiple {output} files in "somedir/". The attempt with touch("rscript_finish.flag") seems to output only the svg File as "rscript_finish.flag" or seems to override "rscript_finish.flag" each time the loop in my rscript writes into snakemake@output[[1]].


Solution

  • There are no stupid questions :). I hope I understood, and it was actually not a trivial question at all!

    def all_input(wildcards):
        checkpoints.rscript.get()  # make sure that checkpoint rscript is executed
        filenames, = glob_wildcards("somedir/{filenames}.png")  # find all the output_files of rscript
        return expand("somedir_cp/{fn}", fn=filenames)
    
    
    rule all:
        input:
            all_input
    
    
    rule add_to_report:
        input:
            "somedir/{filename}.png"
        output:
            report("somedir_cp/{filename}.png")
        shell:
            "cp {input} {output}"
    
    
    checkpoint rscript:
        input:
            "foo.input"
        output:
            touch("rscript_finish.flag")
        script:
            "../scripts/foo.R"
    

    I didn't really test the code, so I am not sure if it immediatly works, but I think the logic is correct.

    The way this needs to be solved is with an extra rule, which I called add_to_report. All this rule does is make a copy of the existing output of rscript, and adds it to the report. The way rule all works is that it first calls for the execution of checkpoint rscript. Once that one is executed it finds all the files it generated. Then it says that rule all needs as input the copy of each file rscript generated, which will be made by rule add_to_report, and thus the files are added to the report.