Search code examples
pythonpython-3.xpipelinesnakemake

Snakemake - Wildcards in input files cannot be determined from output files


I am very new to snakemake and also not so fluent in python (so apologies this might be a very basic stupid question):

I am currently building a pipeline to analyze a set of bamfiles with atlas. These bamfiles are located in different folders and should not be moved to a common one. Therefore I decided to provide a samplelist looking like this (this is just an example, in reality samples might be on totaly different drives):

Sample     Path
Sample1    /some/path/to/my/sample/
Sample2    /some/different/path/

And load it in my config.yaml with:

sample_file: /path/to/samplelist/samplslist.txt

Now to my Snakefile:

import pandas as pd

#define configfile with paths etc.
configfile: "config.yaml"

#read-in dataframe and define Sample and Path
SAMPLES = pd.read_table(config["sample_file"])
BAMFILE = SAMPLES["Sample"]
PATH = SAMPLES["Path"]

rule all:
    input:
        expand("{path}{sample}.summary.txt", zip, path=PATH, sample=BAMFILE)

#this works like a charm as long as I give the zip-function in the rules 'all' and 'summary':

rule indexBam:
    input: 
        "{path}{sample}.bam"
    output:
        "{path}{sample}.bam.bai"
    shell:
        "samtools index {input}"

#this following command works as long as I give the specific folder for a sample instead of {path}.
rule bamdiagnostics:
    input:
        bam="{path}{sample}.bam",
        bai=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE)
    params:
        prefix="analysis/BAMDiagnostics/{sample}"   
    output:
        "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
        "analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
        "analysis/BAMDiagnostics/{sample}_MQ.txt",
        "analysis/BAMDiagnostics/{sample}_readLength.txt",
        "analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
    message:
        "running BamDiagnostics...{wildcards.sample}"
    shell:
        "{config[atlas]} task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"

rule summary:
    input:
        index=expand("{path}{sample}.bam.bai", zip, path=PATH, sample=BAMFILE),
        bamd=expand("analysis/BAMDiagnostics/{sample}_approximateDepth.txt", sample=BAMFILE)
    output:
        "{path}{sample}.summary.txt"
    shell:
        "echo -e '{input.index} {input.bamd}"

I get the error

WildcardError in line 28 of path/to/my/Snakefile: Wildcards in input files cannot be determined from output files: 'path'

Can anyone help me?
- I tried to solve this problem with join, or creating input functions but I think I am just not skilled enough to see my error...
- I guess the problem is, that my summary-rule does not contain the tuplet with the {path} for the bamdiagnostics-output (since the output is somewhere else) and cannot make the connection to the input file or so...
- Expanding my input on bamdiagnostics-rule makes the code work, but of course takes every samples input to every samples output and creates a big mess: In this case, both bamfiles are used for the creation of each outputfile. This is wrong as the samples AND the output are to be treated independently.


Solution

  • Based on the atlas doc, it seems like what you need is to run each rule separately for each sample, the complication here being that each sample is in separate path.

    I modified your script to work for above case (see DAG). Variables in the beginning of script were modified to make better sense. config was removed for demo purposes, and pathlib library was used (instead of os.path.join). pathlib is not necessary, but it helps me keep sanity. A shell command was modified to avoid config.

    import pandas as pd
    from pathlib import Path
    
    df = pd.read_csv('sample.tsv', sep='\t', index_col='Sample')
    SAMPLES = df.index
    BAM_PATH = df["Path"]
    # print (BAM_PATH['sample1'])
    
    rule all:
        input:
            expand("{path}{sample}.summary.txt", zip, path=BAM_PATH, sample=SAMPLES)
    
    
    rule indexBam:
        input:
            str( Path("{path}") / "{sample}.bam")
        output:
            str( Path("{path}") / "{sample}.bam.bai")
        shell:
            "samtools index {input}"
    
    #this following command works as long as I give the specific folder for a sample instead of {path}.
    rule bamdiagnostics:
        input:
            bam = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam"),
            bai = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
        params:
            prefix="analysis/BAMDiagnostics/{sample}"
        output:
            "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
            "analysis/BAMDiagnostics/{sample}_fragmentStats.txt",
            "analysis/BAMDiagnostics/{sample}_MQ.txt",
            "analysis/BAMDiagnostics/{sample}_readLength.txt",
            "analysis/BAMDiagnostics/{sample}_BamDiagnostics.log"
        message:
            "running BamDiagnostics...{wildcards.sample}"
        shell:
            ".atlas task=BAMDiagnostics bam={input.bam} out={params.prefix} logFile={params.prefix}_BamDiagnostics.log verbose"
    
    rule summary:
        input:
            bamd = "analysis/BAMDiagnostics/{sample}_approximateDepth.txt",
            index = lambda wildcards: str( Path(BAM_PATH[wildcards.sample]) / f"{wildcards.sample}.bam.bai"),
        output:
            str( Path("{path}") / "{sample}.summary.txt")
        shell:
            "echo -e '{input.index} {input.bamd}"