I call "converging" a rule that creates one output from multiple inputs:
group2samples = {
"A": ["s1", "s2"],
"B": ["s3", "s4"]}
rule all:
input: [f"{group}.txt" for group in group2samples]
def set_input(wildcards):
return [f"{sample}.txt" for sample in group2samples[wildcards.group]]
rule converging:
input:
set_input
output:
"{group}.txt"
shell:
"cat {input} > {output}"
I would like to create a "diverging" rule instead of a converging one. For instance (likely invalid snakemake code):
group2samples = {
"A": ["s1", "s2"],
"B": ["s3", "s4"]}
rule all:
input:
[
f"{group}/{sample}.txt"
for sample in group2samples[group]
for group in group2samples]
rule diverging:
input:
"{group}.txt"
output:
# Something like
lambda wildcards: [f"{{group}}/{sample}.txt" for sample in group2samples[wildcards.group]]
# (but I don't think output functions of wildcards are possible)
shell:
"my_data_extracting_script.py {input}"
One possibly valid way of proceeding I can think of so would be to put the desired outputs in an archive, and the archive would be the actual output of the rule:
group2samples = {
"A": ["s1", "s2"],
"B": ["s3", "s4"]}
rule all:
[f{group}.tar.bz2" for group in group2samples]
rule diverging:
input:
"{group}.txt"
output:
"{group}.tar.bz2"
shell:
"my_data_extracting_and_archiving_script.py {input}"
But it would be more convenient to have separate files rather than an archive.
Another similar idea would be to use directories as outputs:
group2samples = {
"A": ["s1", "s2"],
"B": ["s3", "s4"]}
rule all:
[f{group}_dir" for group in group2samples]
rule diverging:
input:
"{group}.txt"
output:
directory("{group}_dir")
shell:
"my_data_extracting_script.py --outdir {output} {input}"
But I find it better to have explicit lists of files.
Besides, if I recall correctly, the documentation discourages the use of directory
.
Is there a better way?
This is most easily accomplished with dynamically created rules. Only downside is you may generate lots of rules; if that's the case consider leaving them unnamed.
group2samples = {
"A": ["s1", "s2"],
"B": ["s3", "s4"]}
rule all:
input:
[
f"{group}/{sample}.txt"
for sample in group2samples[group]
for group in group2samples]
for group in group2samples:
rule:
name: f"diverging_{group}" # can remove to leave unnamed
input:
f"{group}.txt" # notice f string here
output:
expand("{group}/{sample}.txt", sample=group2samples[group], group=group)
shell:
"my_data_extracting_script.py {input}"
You are effectively hard coding the inputs and outputs outside the normal wildcard mechanism. I think the more standard names for these rules are scatter and gather operations.