My logic is probably flawed here, and I need your help to get the best possible approach to solving this:
I have a daisy-chained workflow where I annotate a .vcf file in multiple steps, and each step has an intermediate indexing operation (note the difference in extension between input and output):
rule first:
input: "{file}.vcf.gz"
output: "{file}_first.vcf"
shell: "annotate {input} {output}"
rule first_index:
input: rules.first.output
output:
gz="{file}_first.vcf.gz"
tbi="{file}_first.vcf.gz.tbi"
shell: """
bgzip {input}
tabix -p vcf {output.gz}
"""
rule second:
input: rules.first.index.output.gz
output: "{file}_first_second.vcf"
shell: "annotate2 {input} {output}"
rule second_index:
input: rules.second.output
output:
gz="{file}_first_second.vcf.gz"
tbi="{file}_first_second.vcf.gz.tbi"
shell: """
bgzip {input}
tabix -p vcf {output.gz}
"""
... (etc, five or six iterations more like that)
Now, this approach can be quite bulky and lead to some errors due to file naming in the workflow chain, and can be quite inflexible say if I wanted to add a rule in the middle of two existing ones, requiring me to re-edit all the subsequent rules - twice, once for the annotation and the second time for indexing.
There are two moments here that I'd like to optimize. First is the naming of output files based on previous' rule input plus the current rule name. But, this is not crucial. More importantly, I'd like to clean up the indexing step, that is practically universal. Is there a way to design a rule that would reference an input within the output?
Say, take unknown input file and convert it to the same name with a different extension:
rule generic_index:
input: "{file}.vcf"
output:
gz="{input}.gz"
tbi="{input}.gz.tbi"
shell: """
bgzip {input}
tabix -p vcf {output.gz}
"""
This way I'd be able to re-use the index rule with minimal changes after each step:
use rule generic_index as generic_index_first with:
input: rules.first.output
This would eliminate the need to re-specifiy the output names (and hence duplicate information) each time I want to re-use a rule. Any other suggestion / best practice solution that you can share is most welcome!
Thanks!
Your proposed generic index rule looks about right to me, with a couple of small fixes:
rule generic_index:
input: "{file}.vcf"
output:
gz = "{file}.vcf.gz",
tbi = "{file}.vcf.gz.tbi"
shell:
"""
bgzip {input}
tabix -p vcf {output.gz}
"""
Fixes were: You can't reference {input}
in your output file names (only in the shell part); you have to say {file}.vcf
instead. Also you need a comma between the outputs, and a newline after shell:
because the command string is multi-line. note - I didn't test my above code
And this should be all you need. When you ask "how do I re-use this rule" the answer is there is nothing more to do. If you get the names right then Snakemake will chain the rules for you based on pattern matching the names. Looking at the rest of your code, all the explicit references to rules.whatever.output
look wrong. My advice would be to spend a little time running through the official Snakemake tutorial. See how all the rules in these examples link using wildcard patterns - no direct references are made to any other rules.