Search code examples
pythonbioinformaticssnakemake

Can snakemake avoid ambiguity when two different rule paths can generate a given output?


Initial workflow

I have a snakefile that can generate some output from paired-end data.

In this snakefile, I have a rule that "installs" the data given information stored in the config file (get_raw_data).

Then I have a rule that uses that data to generate intermediate files on which the rest of the workflow depends (run_tophat).

Here are the input and output of these rules (OPJ stands for os.path.join):

rule get_raw_data:
    output:
        OPJ(raw_data_dir, "{lib}_1.fastq.gz"),
        OPJ(raw_data_dir, "{lib}_2.fastq.gz"),

(More details on the implementation of this rule later)

rule run_tophat:
    input:
        transcriptome = OPJ(annot_dir, "dmel-all-r5.9.gff"),
        fq1 = OPJ(raw_data_dir, "{lib}_1.fastq.gz"),
        fq2 = OPJ(raw_data_dir, "{lib}_2.fastq.gz"),
    output:
        junctions = OPJ(output_dir, "{lib}", "junctions.bed"),
        bam = OPJ(output_dir, "{lib}", "accepted_hits.bam"),

And (simplifying) my main rule would be something like that:

rule all:
    input:
        expand(OPJ(output_dir, "{lib}", "junctions.bed"), lib=LIBS),

Extending the workflow to single-end data

I now have to run my workflow on data that is single-end.

I would like to avoid the final output having different name patterns depending on whether the data is single or paired end.

I can easily make variants of the above-mentioned two rules that should work with single-end data (get_raw_data_single_end and run_tophat_single_end), whose input and output are as follows:

rule get_raw_data_single_end:
    output:
        OPJ(raw_data_dir, "{lib}.fastq.gz")

rule run_tophat_single_end:
    input:
        transcriptome = OPJ(annot_dir, "dmel-all-r5.9.gff"),
        fq = OPJ(raw_data_dir, "{lib}.fastq.gz"),
    output:
        junctions = OPJ(output_dir, "{lib}", "junctions.bed"),
        bam = OPJ(output_dir, "{lib}", "accepted_hits.bam"),

How to provide snakemake with enough information to chose the correct rule path?

The config file contains the information about whether the lib wildcard is associated with single-end or paired-end data in the following manner: The library names are keys in either a lib2raw or a lib2raw_single_end dictionary (both dictionaries are read from the config file).

I'm not expecting the same library name to be a key in both dictionaries. Therefore, in a sense, it is not ambiguous whether I want the single-end or paired-end branch of the workflow to be executed.

A function lib2data (that uses these dictionaries) is used by both get_raw_data and get_raw_data_single_end to determine what shell command to run to "install" the data.

Here is a simplified version of this function (the actual one contains an extra branch to generate a command for data from a SRR identifier):

def lib2data(wildcards):
    lib = wildcards.lib
    if lib in lib2raw:
        raw = lib2raw[lib]
        link_1 = "ln -s %s %s_1.fastq.gz" % (raw.format(mate="1"), lib)
        link_2 = "ln -s %s %s_2.fastq.gz" % (raw.format(mate="2"), lib)
        return "%s\n%s\n" % (link_1, link_2)
    elif lib in lib2raw_single_end:
        raw = lib2raw_single_end[lib]
        return "ln -s %s %s.fastq.gz\n" % (raw, lib)
    else:
        raise ValueError("Procedure to get raw data for %s unknown." % lib)

Apart from their output, the two get_raw_data* rules are the same and work the following way:

params:
    shell_command = lib2data,
shell:
    """
    (
    cd {raw_data_dir}
    {params.shell_command}
    )
    """

Is snakemake able to determine the correct rule path given information that is not coded in rules input and output, but only in config files and functions?

It seems that it is not the case. Indeed, I'm trying to test my new snakefile (with the added *_single_end rules), but a KeyError occurs during the execution of the get_raw_data rule, whereas the library for which the rule is being executed is associated with single-end data.

How to achieve the desired behaviour (a two-branch workflow able to use the information in the configuration to chose the correct branch)?

Edit: The KeyError was due to an error in lib2data

After using the correct dictionary to get the data associated with the library name, I end up having the following error:

AmbiguousRuleException:
Rules run_tophat and run_tophat_single_end are ambiguous for the file tophat_junction_discovery_revision_supplement/HWT3/junctions.bed.
Expected input files:
        run_tophat: ./HWT3_1.fastq.gz ./HWT3_2.fastq.gz Annotations/dmel-all-r5.9.gff
        run_tophat_single_end: ./HWT3.fastq.gz Annotations/dmel-all-r5.9.gff

Edit 2: Adding input to the get_raw_data* rules

After reading this post on the snakemake mailing list, I tried to add some input to my rules to avoid ambiguity.

def lib2data_input(wildcards):
    lib = wildcards.lib
    if lib in lib2raw:
        raw = lib2raw[lib]
        return [raw.format(mate="1"), raw.format(mate="2")]
    elif lib in lib2raw_single_end:
        raw = lib2raw_single_end[lib]
        return [raw]
    else:
        raise ValueError("Procedure to get raw data for %s unknown." % lib)

rule get_raw_data:
    input:
        lib2data_input
# [same output, params and shell as before]
# [same modification for the single-end case]

This results in a MissingInputException. Strangely, the reportedly missing file does exist. Is the trick supposed to work? (Can't reproduce this, now this results in:)

AmbiguousRuleException:
Rules run_tophat_single_end and run_tophat are ambiguous for the file tophat_junction_discovery_revision_supplement/HTW2/junctions.bed.
Expected input files:
        run_tophat_single_end: ./HTW2.fastq.gz Annotations/dmel-all-r5.9.gff
        run_tophat: ./HTW2_1.fastq.gz ./HTW2_2.fastq.gz Annotations/dmel-all-r5.9.gff

My way of specifying an input to the "data installation" rules is apparently not enough to guide snakemake to the correct rule.


Solution

  • I don't know if it helps, but you can use a function to define the inputs of a rule. This way you can use the same rule to process either single-end or paired-end data, given that the output of the rule is the same...

    def my_inputs(wildcards):
        data_type = config["data_type"]
        if (data_type == "pe"):
            input = ...
        elif (data_type == "se"):
            input = ...
        return input
    
    rule my_rule:
        input: my_inputs
        ...