Search code examples
snakemake

Snakemake changes wildcard, resulting in InputFunctionException


I keep getting the same errors at the same step in the pipeline. I have 2 rules named typing and apply_qc which somehow conflict. typing uses outputs from another rule, polish_consensus, and apply_qc uses the outputs of typing (so the order: polish_consensus > typing > apply_qc). The outputs of typing are a fasta and CSV file. apply_qc is a quality control step, which will censor the data of these files when of low quality. Now I keep getting the same errors with the rules:

1. The code:

rule typing:
    input:
        f"{DATA_FOLDER}/vaccines.fasta",
        rules.polish_consensus.output
    output:
        temp(f"{OUTPUT_FOLDER}/typing/{{samplename}}.csv")
    script:
        "../scripts/typing.py"

rule apply_qc:
    input:
        rules.typing.output,
        rules.polish_consensus.output,
        rules.featurecounts.output.summary
    output:
        typing=f"{OUTPUT_FOLDER}/typing/{{samplename}}_qcpass.csv",
        consensus=f"{OUTPUT_FOLDER}/consensus/{{samplename}}_qcpass.fasta"
    script:
        "../scripts/apply_qc.py"

The output of the rule polish_consensus is output/consensus/{samplename}.fasta with samplename=prrsv12.

The error:

InputFunctionException in rule typing in file /home/lisah/Pycharm/minor-HTHPC/snakemake/workflow/rules/typing.smk, line 1:
Error:
  KeyError: 'prrsv12_qcpass'
Wildcards:
  samplename=prrsv12_qcpass
Traceback:
  File "/home/lisah/Pycharm/minor-HTHPC/snakemake/workflow/rules/typing.smk", line 12, in <lambda>

The error shows that the wildcard used is prrsv12_qcpass, but the wildcard is prrsv12 + prrsv12_qcpass is the filename of an output of the apply_qc rule.

  1. The second error is something I hope I already fixed, but it shows more info than the previous error:
AmbiguousRuleException:
Rules apply_qc and typing are ambiguous for the file output/typing/prrsv12_qcpass.csv.
Consider starting rule output with a unique prefix, constrain your wildcards, or use the ruleorder directive.
Wildcards:
        apply_qc: samplename=prrsv12
        typing: samplename=prrsv12_qcpass
Expected input files:
        apply_qc: output/typing/prrsv12.csv output/consensus/prrsv12.fasta output/counts/prrsv12_summary.csv
        typing: data/prrsv/vaccines.fasta output/consensus/prrsv12_qcpass.fasta
Expected output files:
        apply_qc: output/typing/prrsv12_qcpass.csv output/consensus/prrsv12_qcpass.fasta
        typing: output/typing/prrsv12_qcpass.csv

As said before, the wildcard for typing is wrong, but the wildcard for apply_qc is correct (?????). Likewise, the expected input for typing is not output/consensus/prrsv12_qcpass.fasta but output/consensus/prrsv12.fasta.

I hoped I fixed the AmbiguousRuleException by using the rules.<rule>.output syntax and adding a ruleorder. As for the first error, I am completely lost and have no idea why this happens. It seems like an error with the wildcard, but I have no idea how the _qcpass part is added to the wildcard. It also seems like this error happens at random: Some runs work fine and others it crashes into this (Yes, run with the same data).

EDIT:

I tried running it with the --debug-dag and the only thing that popped out is the following:

selected job readcap
    wildcards: samplename=prrsv20_qcpass
file output/fastq/prrsv20_qcpass_readcap.fastq.gz:
    Producer found, hence exceptions are ignored.

candidate job select_centroid
    wildcards: samplename=prrsv20_qcpass
candidate job featurecounts
    wildcards: samplename=prrsv20_qcpass
candidate job map2ref
    wildcards: samplename=prrsv20_qcpass
candidate job apply_qc
    wildcards: samplename=prrsv20
selected job apply_qc
    wildcards: samplename=prrsv20

The _qcpass is added to the wildcard for the rest of the pipeline, but seems to work fine for apply_qc? apply_qc is one of the last rules in the pipeline...


Solution

  • It sounds like you misunderstood what ambiguous rules mean for snakemake, why you should avoid them and why ruleorder will not solve your problem.

    First of all, here's a MWE - a minimal working example which reproduces your issue. Note that it is easier for everyone if you provide such an example and the call used to run snakemake.

    In this case, the problem can be reproduced by calling snakemake -call:

    rule polish_consensus:
        output:
            "consensus/{samplename}.fasta",
        shell:
            """
            echo polish_consensus > {output[0]}
            """
    
    
    rule typing:
        input:
            rules.polish_consensus.output,
        output:
            "typing/{samplename}.csv",
        shell:
            """
            cat {input[0]} > {output[0]}
            """
    
    
    rule apply_qc:
        input:
            rules.typing.output,
            rules.polish_consensus.output,
        output:
            typing="typing/{samplename}_qcpass.csv",
            consensus="onsensus/{samplename}_qcpass.fasta",
        shell:
            """
            echo qcpass > {output[0]}
            echo qcpass > {output[1]}
            """
    
    
    rule all:
        default_target: True
        input:
            expand(rules.apply_qc.output[0], samplename="prrsv12"),
            expand(rules.apply_qc.output[1], samplename="prrsv12"),
    

    Your wildcard {samplename} will be matched by snakemake against all the output-files you request as well as files snakemake has to generate to run the workflow.

    Now requesting typing/prrsv12_qcpass.csv matches the output of rule apply_qc with samplename=prrsv12 as well as rule typing with samplename=prrsv12_qcpass. To prevent this you should constrain your wildcard rather than trying a ruleorder or using references to a rules.<name>.output.

    By using a wildcard_constraint you tell snakemake which strings a wildcard can match. In your case, your samplename is presumably never going to contain an underscore, i.e. you can use:

    wildcard_constraints:
        samplename="[a-zA-Z0-9]+",
    

    to tell snakemake to match against small/capital letters an numbers from 0-9, but not any whitespace or other symbols like underscore. This will make snakemake never consider prrsv12_qcpass as the wildcard value for samplefile, but only prsv12 as the wildcard and _qcpass as an additional, fixed part of the filename.

    More on wildcard_constraints can be found in the documentation

    Putting everything together into a single Snakefile:

    wildcard_constraints:
        samplename="[a-zA-Z0-9]+",
    
    rule polish_consensus:
        output:
            "consensus/{samplename}.fasta",
        shell:
            """
            echo polish_consensus > {output[0]}
            """
    
    
    rule typing:
        input:
            rules.polish_consensus.output,
        output:
            "typing/{samplename}.csv",
        shell:
            """
            cat {input[0]} > {output[0]}
            """
    
    
    rule apply_qc:
        input:
            rules.typing.output,
            rules.polish_consensus.output,
        output:
            typing="typing/{samplename}_qcpass.csv",
            consensus="onsensus/{samplename}_qcpass.fasta",
        shell:
            """
            echo qcpass > {output[0]}
            echo qcpass > {output[1]}
            """
    
    
    rule all:
        default_target: True
        input:
            expand(rules.apply_qc.output[0], samplename="prrsv12"),
            expand(rules.apply_qc.output[1], samplename="prrsv12"),