I have a snakefile that can generate some output from paired-end data.
In this snakefile, I have a rule that "installs" the data given information stored in the config file (get_raw_data
).
Then I have a rule that uses that data to generate intermediate files on which the rest of the workflow depends (run_tophat
).
Here are the input and output of these rules (OPJ
stands for os.path.join
):
rule get_raw_data:
output:
OPJ(raw_data_dir, "{lib}_1.fastq.gz"),
OPJ(raw_data_dir, "{lib}_2.fastq.gz"),
(More details on the implementation of this rule later)
rule run_tophat:
input:
transcriptome = OPJ(annot_dir, "dmel-all-r5.9.gff"),
fq1 = OPJ(raw_data_dir, "{lib}_1.fastq.gz"),
fq2 = OPJ(raw_data_dir, "{lib}_2.fastq.gz"),
output:
junctions = OPJ(output_dir, "{lib}", "junctions.bed"),
bam = OPJ(output_dir, "{lib}", "accepted_hits.bam"),
And (simplifying) my main rule would be something like that:
rule all:
input:
expand(OPJ(output_dir, "{lib}", "junctions.bed"), lib=LIBS),
I now have to run my workflow on data that is single-end.
I would like to avoid the final output having different name patterns depending on whether the data is single or paired end.
I can easily make variants of the above-mentioned two rules that should work with single-end data (get_raw_data_single_end
and run_tophat_single_end
), whose input and output are as follows:
rule get_raw_data_single_end:
output:
OPJ(raw_data_dir, "{lib}.fastq.gz")
rule run_tophat_single_end:
input:
transcriptome = OPJ(annot_dir, "dmel-all-r5.9.gff"),
fq = OPJ(raw_data_dir, "{lib}.fastq.gz"),
output:
junctions = OPJ(output_dir, "{lib}", "junctions.bed"),
bam = OPJ(output_dir, "{lib}", "accepted_hits.bam"),
The config file contains the information about whether the lib
wildcard is associated with single-end or paired-end data in the following manner: The library names are keys in either a lib2raw
or a lib2raw_single_end
dictionary (both dictionaries are read from the config file).
I'm not expecting the same library name to be a key in both dictionaries. Therefore, in a sense, it is not ambiguous whether I want the single-end or paired-end branch of the workflow to be executed.
A function lib2data
(that uses these dictionaries) is used by both get_raw_data
and get_raw_data_single_end
to determine what shell command to run to "install" the data.
Here is a simplified version of this function (the actual one contains an extra branch to generate a command for data from a SRR identifier):
def lib2data(wildcards):
lib = wildcards.lib
if lib in lib2raw:
raw = lib2raw[lib]
link_1 = "ln -s %s %s_1.fastq.gz" % (raw.format(mate="1"), lib)
link_2 = "ln -s %s %s_2.fastq.gz" % (raw.format(mate="2"), lib)
return "%s\n%s\n" % (link_1, link_2)
elif lib in lib2raw_single_end:
raw = lib2raw_single_end[lib]
return "ln -s %s %s.fastq.gz\n" % (raw, lib)
else:
raise ValueError("Procedure to get raw data for %s unknown." % lib)
Apart from their output, the two get_raw_data*
rules are the same and work the following way:
params:
shell_command = lib2data,
shell:
"""
(
cd {raw_data_dir}
{params.shell_command}
)
"""
Is snakemake able to determine the correct rule path given information that is not coded in rules input and output, but only in config files and functions?
It seems that it is not the case. Indeed, I'm trying to test my new snakefile (with the added *_single_end
rules), but a .KeyError
occurs during the execution of the get_raw_data
rule, whereas the library for which the rule is being executed is associated with single-end data
How to achieve the desired behaviour (a two-branch workflow able to use the information in the configuration to chose the correct branch)?
KeyError
was due to an error in lib2data
After using the correct dictionary to get the data associated with the library name, I end up having the following error:
AmbiguousRuleException:
Rules run_tophat and run_tophat_single_end are ambiguous for the file tophat_junction_discovery_revision_supplement/HWT3/junctions.bed.
Expected input files:
run_tophat: ./HWT3_1.fastq.gz ./HWT3_2.fastq.gz Annotations/dmel-all-r5.9.gff
run_tophat_single_end: ./HWT3.fastq.gz Annotations/dmel-all-r5.9.gff
get_raw_data*
rulesAfter reading this post on the snakemake mailing list, I tried to add some input to my rules to avoid ambiguity.
def lib2data_input(wildcards):
lib = wildcards.lib
if lib in lib2raw:
raw = lib2raw[lib]
return [raw.format(mate="1"), raw.format(mate="2")]
elif lib in lib2raw_single_end:
raw = lib2raw_single_end[lib]
return [raw]
else:
raise ValueError("Procedure to get raw data for %s unknown." % lib)
rule get_raw_data:
input:
lib2data_input
# [same output, params and shell as before]
# [same modification for the single-end case]
This results in a (Can't reproduce this, now this results in:)MissingInputException
. Strangely, the reportedly missing file does exist. Is the trick supposed to work?
AmbiguousRuleException:
Rules run_tophat_single_end and run_tophat are ambiguous for the file tophat_junction_discovery_revision_supplement/HTW2/junctions.bed.
Expected input files:
run_tophat_single_end: ./HTW2.fastq.gz Annotations/dmel-all-r5.9.gff
run_tophat: ./HTW2_1.fastq.gz ./HTW2_2.fastq.gz Annotations/dmel-all-r5.9.gff
My way of specifying an input to the "data installation" rules is apparently not enough to guide snakemake to the correct rule.
I don't know if it helps, but you can use a function to define the inputs of a rule. This way you can use the same rule to process either single-end or paired-end data, given that the output of the rule is the same...
def my_inputs(wildcards):
data_type = config["data_type"]
if (data_type == "pe"):
input = ...
elif (data_type == "se"):
input = ...
return input
rule my_rule:
input: my_inputs
...