In a pipeline that I use to work on different projects, I have a rule that takes a file, following the pattern tei/xxx_xx_xxxxx_xxxxx.xml
as input. Depending on the project 2 possible outputs are possible, either one file called xhtml/xxx_xx_xxxxx_xxxxx.html
or many files following the pattern xhtml/xxx_xx_xxxxx_xxxxx_sec_n
(where n is a counter for the different files).
The problem is that it is not predictable at the beginning if the project is a case 1 or a case 2 project. It is decided in the script that is run as the action of the rule. Thus, I neither know, how to define the input in the default rule which request those file(s) nor how to define the output of the rule that creates those file(s).
I think it is probably a case for using checkpoint()
, but from the examples I found I was not able to see how.
This is a simplified/reduced version of the scenario:
rule all:
input: # How to define the input when it is not clear if it is case 1 file or case 2 files
rule xhtml_manuscript:
input:
tei_manuscript = 'tei/xxx_xx_xxxxx_xxxxx.html'
output:
xhtml_manuscript = # How to define the input when it is not clear if it is case 1 file or case 2
run:
shell(f'java -jar {SAXON} -o:xxx_xx_xxxxx_xxxxx.html {{input}} {TRANSFORMDIR}/other/opt_split_html_sections.xsl')
Possible output:
xxx_xx_xxxxx_xxxxx.html
or
xxx_xx_xxxxx_xxxxx_sec_1.html
xxx_xx_xxxxx_xxxxx_sec_2.html
xxx_xx_xxxxx_xxxxx_sec_3.html
xxx_xx_xxxxx_xxxxx_sec_4.html
xxx_xx_xxxxx_xxxxx_sec_5.html
...
This is just Sultan's answer made more explicit. OP asks in comment:
the rule still creates the html file(s) but I do not mention them in the output explicitly, in favour of the tmp file
Yes, that's the idea. In fact, I would call the tmp file a "flag" file and I wouldn't mark is temporary. E.g:
rule all:
input:
'tei/xxx_xx_xxxxx_xxxxx.done',
rule xhtml_manuscript:
input:
tei_manuscript = 'tei/xxx_xx_xxxxx_xxxxx.html'
output:
# Note the touch function
xhtml_manuscript = touch('tei/xxx_xx_xxxxx_xxxxx.done'),
run:
shell(f'java -jar {SAXON} -o:xxx_xx_xxxxx_xxxxx.html {{input}} {TRANSFORMDIR}/other/opt_split_html_sections.xsl')
it [the flag file] would probably make the xhtml_manuscript succeed
Not really, snakemake will touch the flag file tei/xxx_xx_xxxxx_xxxxx.done
only if the run or shell directive succeeds. So if the flag file is present you can be sure the underlying rule has exited with 0 exit code. Besides, you don't need to use the touch function and you could explicitly check that some files have been created. You could do:
shell:
"""
rm -rf <expected output html files>
java -jar <create html file(s)>
if this or that html file exists:
touch {output.xhtml_manuscript}
else:
exit 1
"""
Is that not a bit dirty and intransparent
I don't know... I got used to this way of handling such cases and it looks ok to me. Ultimately though, I would say the "dirt" may be more with the structure of the pipeline or the program causing the ambiguous output. I think snakemake is doing the right thing in making such cases somewhat clunky.