Elegantly handle samples with insufficient data in workflow?

I've set up a Snakemake pipeline for doing some simple QC and analysis on shallow shotgun metagenomics samples coming through our lab.

Some of the tools in the pipeline will fail or error when samples with low amounts of data are delivered as inputs -- but this is sometimes not knowable from the raw input data, as intermediate filtering steps (such as adapter trimming and host genome removal) can remove varying numbers of reads.

Ideally, I would like to be able to handle these cases with some sort of check on certain input rules, which could evaluate the number of reads in an input file and choose whether or not to continue with that portion of the workflow graph. Has anyone implemented something like this successfully?

Many thanks, -jon

Solution

I'm not aware of the possibility to not complete the workflow based on some computation happening inside the workflow. The rules to be executed are determined based on the final required output, and failure will happen if this final output cannot be generated.

One approach could be catch the particular tool failure (try ... except construct in a run section or return code handling in a shell section) and generate a dummy output file for the corresponding rule, and have the downstream rules "propagate" dummy file generation based on a test identifying the rule's input as such a dummy file.

Another approach could be to pre-process the data outside of your snakemake workflow to determine which input to skip, and then use some filtering on the wildcards combinations as described here: https://stackoverflow.com/a/41185568/1878788.