Search code examples
snakemake

combine to outputs of diffrent rules in Snakemake


I would like to use snakemake to first merge some files and than later process other files based on that merge. (Less abstract: I want to combine control IGG bam files of two different sets and than use those to perform peakcalling on other files.

In a minimal example, the folder structure would look like this.

├── data
│   ├── toBeMerged
│   │   ├── singleA
│   │   ├── singleB
│   │   ├── singleC
│   │   └── singleD
│   └── toBeProcessed
│       ├── NotProcess1
│       ├── NotProcess2
│       ├── NotProcess3
│       ├── NotProcess4
│       └── NotProcess5
├── merge.cfg
├── output
│   ├── mergeAB_merge
│   ├── mergeCD_merge
│   ├── NotProcess1_processed
│   ├── NotProcess2_processed
│   ├── NotProcess3_processed
│   ├── NotProcess4_processed
│   └── NotProcess5_processed
├── process.cfg
└── Snakefile

Which files are combined and which are processed are defined in two config files. merge.cfg

singlePath  controlName
data/toBeMerged/singleA output/controlAB
data/toBeMerged/singleB output/controlAB
data/toBeMerged/singleC output/controlCD
data/toBeMerged/singleD output/controlCD

and process.cfg

controlName inName
output/controlAB    data/toBeProcessed/NotProcess1
output/controlAB    data/toBeProcessed/NotProcess2
output/controlCD    data/toBeProcessed/NotProcess3
output/controlCD    data/toBeProcessed/NotProcess4
output/controlAB    data/toBeProcessed/NotProcess5

I am currently stuck with a snakefile like this, which itself does not work and gives me the error that both rules are ambiguous. And even if I would get it to work, I suspect, that this not the "correct" way, since the process rule, should have {mergeName} as input to build its dag. But this does not work, since then I would need two wildcarts in one rule.

import pandas as pd
cfgMerge = pd.read_table("merge.cfg").set_index("controlName", drop=False)
cfgProc= pd.read_table("process.cfg").set_index("inName", drop=False)


rule all:
    input:
        expand('{mergeName}', mergeName= cfgMerge.controlName),
        expand('{rawName}_processed', rawName= cfgProc.inName)

rule merge:
    input:
        lambda wc: cfgMerge[cfgMerge.controlName == wc.mergeName].singlePath
    output:
        "{mergeName}"
    shell:
        "cat {input} > {output}"

rule process:
    input:
        inMerge=lambda wc: cfgProc[cfgProc.inName == wc.rawName].controlName.iloc[0],
        Name=lambda wc: cfgProc[cfgProc.inName == wc.rawName].inName.iloc[0]
    output:
        '{rawName}_processed'
    shell:
    "cat {input.inMerge} {input.Name} > {output}"

I guess the key problem is how to use the output of a rule as the input for another one, when it does not depend on the same wildcard, or includes other another wildcard.


Solution

  • For future reference: The problem did not seem to be the "using the output of a rule as the input for another one, when it does not depend on the same wildcard, or includes other another wildcard." It seems that the input for rule all and the output for the other two rules where ambiguous. The simple solution is to put every output in a different directory and it worked (see below).

    import pandas as pd
    cfgMerge = pd.read_table("merge.cfg").set_index("controlName", drop=False)
    cfgProc= pd.read_table("process.cfg").set_index("inName", drop=False)
    
    #ruleorder: merge > process
    
    rule all:
        input:
            expand('output/bam/{rawName}_processed', rawName= cfgProc.inName),
            expand('output/control/{controlNameSnake}', controlNameSnake= cfgMerge.controlName.unique())
    
    rule merge:
        input:
            lambda wc: cfgMerge[cfgMerge.controlName == wc.controlNameSnake].singlePath.unique()
        output:
            'output/control/{controlNameSnake}'
        shell:
            'echo {input} > {output}'
    
    
    rule process:
        input:
            in1="data/toBeProcessed/{rawName}",
            in2=lambda wc: "output/control/"+"".join(cfgProc[cfgProc.inName == wc.rawName].controlName.unique())
        output:
            'output/bam/{rawName}_processed'
        shell:
            'echo {input} > {output}'