I would like to use snakemake to first merge some files and than later process other files based on that merge. (Less abstract: I want to combine control IGG bam files of two different sets and than use those to perform peakcalling on other files.
In a minimal example, the folder structure would look like this.
├── data
│ ├── toBeMerged
│ │ ├── singleA
│ │ ├── singleB
│ │ ├── singleC
│ │ └── singleD
│ └── toBeProcessed
│ ├── NotProcess1
│ ├── NotProcess2
│ ├── NotProcess3
│ ├── NotProcess4
│ └── NotProcess5
├── merge.cfg
├── output
│ ├── mergeAB_merge
│ ├── mergeCD_merge
│ ├── NotProcess1_processed
│ ├── NotProcess2_processed
│ ├── NotProcess3_processed
│ ├── NotProcess4_processed
│ └── NotProcess5_processed
├── process.cfg
└── Snakefile
Which files are combined and which are processed are defined in two config files. merge.cfg
singlePath controlName
data/toBeMerged/singleA output/controlAB
data/toBeMerged/singleB output/controlAB
data/toBeMerged/singleC output/controlCD
data/toBeMerged/singleD output/controlCD
and process.cfg
controlName inName
output/controlAB data/toBeProcessed/NotProcess1
output/controlAB data/toBeProcessed/NotProcess2
output/controlCD data/toBeProcessed/NotProcess3
output/controlCD data/toBeProcessed/NotProcess4
output/controlAB data/toBeProcessed/NotProcess5
I am currently stuck with a snakefile like this, which itself does not work and gives me the error that both rules are ambiguous. And even if I would get it to work, I suspect, that this not the "correct" way, since the process rule, should have {mergeName} as input to build its dag. But this does not work, since then I would need two wildcarts in one rule.
import pandas as pd
cfgMerge = pd.read_table("merge.cfg").set_index("controlName", drop=False)
cfgProc= pd.read_table("process.cfg").set_index("inName", drop=False)
rule all:
input:
expand('{mergeName}', mergeName= cfgMerge.controlName),
expand('{rawName}_processed', rawName= cfgProc.inName)
rule merge:
input:
lambda wc: cfgMerge[cfgMerge.controlName == wc.mergeName].singlePath
output:
"{mergeName}"
shell:
"cat {input} > {output}"
rule process:
input:
inMerge=lambda wc: cfgProc[cfgProc.inName == wc.rawName].controlName.iloc[0],
Name=lambda wc: cfgProc[cfgProc.inName == wc.rawName].inName.iloc[0]
output:
'{rawName}_processed'
shell:
"cat {input.inMerge} {input.Name} > {output}"
I guess the key problem is how to use the output of a rule as the input for another one, when it does not depend on the same wildcard, or includes other another wildcard.
For future reference: The problem did not seem to be the "using the output of a rule as the input for another one, when it does not depend on the same wildcard, or includes other another wildcard." It seems that the input for rule all and the output for the other two rules where ambiguous. The simple solution is to put every output in a different directory and it worked (see below).
import pandas as pd
cfgMerge = pd.read_table("merge.cfg").set_index("controlName", drop=False)
cfgProc= pd.read_table("process.cfg").set_index("inName", drop=False)
#ruleorder: merge > process
rule all:
input:
expand('output/bam/{rawName}_processed', rawName= cfgProc.inName),
expand('output/control/{controlNameSnake}', controlNameSnake= cfgMerge.controlName.unique())
rule merge:
input:
lambda wc: cfgMerge[cfgMerge.controlName == wc.controlNameSnake].singlePath.unique()
output:
'output/control/{controlNameSnake}'
shell:
'echo {input} > {output}'
rule process:
input:
in1="data/toBeProcessed/{rawName}",
in2=lambda wc: "output/control/"+"".join(cfgProc[cfgProc.inName == wc.rawName].controlName.unique())
output:
'output/bam/{rawName}_processed'
shell:
'echo {input} > {output}'