Search code examples
pythonwildcardsnakemake

How can I define an output that can be either one thing or another in Snakemake?


In a pipeline that I use to work on different projects, I have a rule that takes a file, following the pattern tei/xxx_xx_xxxxx_xxxxx.xml as input. Depending on the project 2 possible outputs are possible, either one file called xhtml/xxx_xx_xxxxx_xxxxx.html or many files following the pattern xhtml/xxx_xx_xxxxx_xxxxx_sec_n (where n is a counter for the different files).

The problem is that it is not predictable at the beginning if the project is a case 1 or a case 2 project. It is decided in the script that is run as the action of the rule. Thus, I neither know, how to define the input in the default rule which request those file(s) nor how to define the output of the rule that creates those file(s).

I think it is probably a case for using checkpoint(), but from the examples I found I was not able to see how.

This is a simplified/reduced version of the scenario:

rule all:
    input: # How to define the input when it is not clear if it is case 1 file or case 2 files

rule xhtml_manuscript:
    input: 
        tei_manuscript = 'tei/xxx_xx_xxxxx_xxxxx.html'
    output: 
        xhtml_manuscript = # How to define the input when it is not clear if it is case 1 file or case 2
    run: 
        shell(f'java -jar {SAXON} -o:xxx_xx_xxxxx_xxxxx.html {{input}} {TRANSFORMDIR}/other/opt_split_html_sections.xsl')

Possible output:

xxx_xx_xxxxx_xxxxx.html

or

xxx_xx_xxxxx_xxxxx_sec_1.html
xxx_xx_xxxxx_xxxxx_sec_2.html
xxx_xx_xxxxx_xxxxx_sec_3.html
xxx_xx_xxxxx_xxxxx_sec_4.html
xxx_xx_xxxxx_xxxxx_sec_5.html
...

Solution

  • This is just Sultan's answer made more explicit. OP asks in comment:

    the rule still creates the html file(s) but I do not mention them in the output explicitly, in favour of the tmp file

    Yes, that's the idea. In fact, I would call the tmp file a "flag" file and I wouldn't mark is temporary. E.g:

    rule all:
        input:
            'tei/xxx_xx_xxxxx_xxxxx.done',
    
    rule xhtml_manuscript:
        input: 
            tei_manuscript = 'tei/xxx_xx_xxxxx_xxxxx.html'
        output: 
            # Note the touch function
            xhtml_manuscript = touch('tei/xxx_xx_xxxxx_xxxxx.done'),
        run: 
            shell(f'java -jar {SAXON} -o:xxx_xx_xxxxx_xxxxx.html {{input}} {TRANSFORMDIR}/other/opt_split_html_sections.xsl')
    

    it [the flag file] would probably make the xhtml_manuscript succeed

    Not really, snakemake will touch the flag file tei/xxx_xx_xxxxx_xxxxx.done only if the run or shell directive succeeds. So if the flag file is present you can be sure the underlying rule has exited with 0 exit code. Besides, you don't need to use the touch function and you could explicitly check that some files have been created. You could do:

    shell: 
        """
        rm -rf <expected output html files>
    
        java -jar <create html file(s)>
    
        if this or that html file exists:
            touch {output.xhtml_manuscript}
        else:
            exit 1
        """
    

    Is that not a bit dirty and intransparent

    I don't know... I got used to this way of handling such cases and it looks ok to me. Ultimately though, I would say the "dirt" may be more with the structure of the pipeline or the program causing the ambiguous output. I think snakemake is doing the right thing in making such cases somewhat clunky.