I am using snakemake to design a RNAseq-data analysis pipeline. While I've managed to do that, I want to make my pipeline to be as adaptable as possible and make it able to deal with single-reads (SE) data or paired-end (PE) data within the same run of analyses, instead of analysing SE data in one run and PE data in another.
My pipeline is supposed to be designed like this :
Note : all rules of A have 1 input and 1 output, all rules of B have 2 inputs and 2 outputs and their respective commands look like :
somecommand -i {input} -o {output}
somecommand -i1 {input1} -i2 {input2} -o1 {output1} -o2 {output2}
Note 2 : except their differences in inputs/outputs, all rules of sets A and B have the same commands, parameters/etc...
In other words, I want my pipeline to be able to switch between the execution of set of rules A or set of rules B depending on the sample, either by giving it information on the sample in a config file at the start (sample 1 is SE, sample 2 is PE... this is known before-hand) or asking snakemake to counts the number of files after the dataset download to choose the proper next set of rules for each sample. If you see another way to do that, you're welcome to tell be about it.
I thought about using checkpoints, input functions and if/else statement, but I haven't managed to solve my problem with these.
Do you have any hints/advice/ways to make that "switch" happen ?
If you know the layout beforehand, then the easiest way would be to store it in some variable, something like this (or alternatively you read this from a config file into a dictionary):
layouts = {"sample1": "paired", "sample2": "single", ... etc}
What you can then do is "merge" your rule like this (I am guessing you are talking about trimming and alignment, so that's my example):
ruleorder: B > A
rule A:
input:
{sample}.fastq.gz
output:
trimmed_{sample}.fastq.gz
shell:
"somecommand -i {input} -o {output}"
rule B:
input:
input1={sample}_R1.fastq.gz,
input2={sample}_R2.fastq.gz
output:
output1=trimmed_{sample}_R1.fastq.gz,
output2=trimmed_{sample}_R2.fastq.gz
shell:
"somecommand -i1 {input.input1} -i2 {input.input2} -o1 {output.output1} -o2 {output.output2}"
def get_fastqs(wildcards):
output = dict()
if layouts[wildcards.sample] == "single":
output["input"] = "trimmed_sample2.fastq.gz"
elif layouts[wildcards.sample] == "paired":
output["input1"] = "trimmed_sample1_R1.fastq.gz"
output["input2"] = "trimmed_sample1_R2.fastq.gz"
return output
rule alignment:
def input:
unpack(get_fastqs)
def output:
somepath/{sample}.bam
shell:
...
There is a lot of stuff going on here.
Some self-promotion: I made a snakemake pipeline which does many things, including RNA-seq and downloading of samples online and automatically determining their layout (single-end vs paired-end). Please take a look and see if it solves your problem: https://vanheeringen-lab.github.io/seq2science/content/workflows/rna_seq.html
EDIT:
That was unclear wording of me. With merging I meant to "merge the single-end and paired-end and paired-end logic together, so you can continue with a single rule (e.g. count table, you name it).
Exactly! When a rule needs trimmed_sample1_R1.fastq.gz, how would Snakemake know the name of your sample? Is the name of the sample, sample1, or is it sample1_R1? It can be either, and that makes snakemake complain that it does not know how to resolve this. When you add a ruleorder you tell Snakemake, when it is unclear, resolve in this order.
Yes that's the way we solved it. We did it in that way since we want every rule to have it's own environment. If you do not use a seperate conda environment for alignment, then you can do it cleaner/prettier, like so
rule alignment:
input:
unpack(get_fastqs)
output:
somepath/{sample}.bam
run:
if layouts[wildcards.sample] == "single":
shell("single-end command")
if layouts[wildcards.sample] == "paired":
shell("paired-end command")
I feel like this option is much clearer than what we did in the seq2science pipeline. However in the seq2science pipeline we support many different aligners and they all have a different conda environment, so the run
directive can not be used.