Search code examples
pythonsnakemakefastq

How to handle variable numbers of replicates in snakemake


I am processing a large number of fastq files. I have a number of folders, one for each sample, each containing multiple replicates (repl), eg a folder may contain:

Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R{r}_{repl}.fastq.gz
{r} = 1 or 2
{repl} = [1...n+1] is a three digit integer with leading zeros

each folder has a different number of files, e.g. n may be 13 for example in which case there would be the following files:

Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_001.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_002.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_003.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_004.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_005.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_006.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_007.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_008.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_009.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_010.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_011.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_012.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_013.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R1_014.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_001.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_002.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_003.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_004.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_005.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_006.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_007.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_008.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_009.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_010.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_011.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_012.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_013.fastq.gz
Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001_R2_014.fastq.gz

...

I need to pass to salmon in my shell command r1 and r2 where r1 is the list of files for the sample containing 'R1' in the filename, likewise for R2.

Since I don't know how may replicates there will be in each folder I thought to check if file exists and return the list from my own function repl_expand. It however produces the list of all files in all samples rather than being constrained per sample, as it is passed a list from expand(["{sample}"], sample=DATASETS).

It would seem like I could just pass "{sample"} to my function, but this is just a string... the substitution of {sample} with the actual sample name is apparently done by snakemake after the call to my function, but before the call if I use the snakemake expand function.

To be clear I want salmon to be run three times, each time with r1 being the list of 'R1' files for the sample, likewise for r2.

Therefore I'd like to be able to pass just the single iterated sample name to my function in each of three calls to produce r1 and r2 for my salmon commandline for that sample.

Here is my Snakefile with does produce three salmon runs but all are the same, containing the data from all samples.

import os

MAX_REPLICATES = 30
DATASETS = ["Sample_AJL-0704-ME/AJL-0704-ME_GCCAAT_L001",
 "Sample_BG-1058-ME/BG-1058-ME_ACAGTG_L002",
 "Sample_BO-0712-ME/BO-0712-ME_GTGAAA_L002"]
# and many more...

def repl_expand(templates, direction):
    ss = []
    for templ in templates:
        templ = templ + "_R{dir}_{repl:03d}.fastq.gz"
        s = []
        for i in range (1, MAX_REPLICATES):
            name = templ.format(dir=direction, repl=i)
            if os.path.exists(templ.format(dir=direction, repl=i)):
                s.append(name)
        ss.append(s)
    return ss

SALMON = "/path/to/salmon/salmon-1.4.0_linux_x86_64/salmon-latest_linux_x86_64/bin/salmon"

rule all:
    input: expand(["quants/{dataset}.sf"], dataset=DATASETS)


rule salmon_quant:
    input:
        r1 = repl_expand(expand(["{sample}"], sample=DATASETS), 1),
        r2 = repl_expand(expand(["{sample}"], sample=DATASETS), 2),
        index = "path/to/salmon/salmon_sa_index"
    output:
        "quants/{sample}.sf"
    params:
        dir = "quants",
        libtype = 'A',
        threads = 12
    shell:
        "{SALMON} quant \
         -i {input.index} \
         --libtype {params.libtype} \
         --threads {params.threads} \
         --validateMappings \
         --gcBias \
         -o {params.dir} \
         -1 {input.r1} -2 {input.r2}"

I'm new to snakemake so may be approaching this completely wrong and would be happy to be set on the right path...

thank you


Solution

  • First remark: you don't need to provide a list to the expand function. This function takes a template string and returns a list of strings as a product of the variables substitution. So the rule all should be:

    rule all:
        input: expand("quants/{dataset}.sf", dataset=DATASETS)
    

    Next remark is about the wildcards naming. However Snakemake doesn't care if you call the same entity with different names in different rules, I would recommend you to be consistent and explicit to avoid mistakes, so the output of the rule salmon_quant may sound like:

    rule salmon_quant:
        output:
            "quants/{dataset}.sf"
    

    Interestingly, you use the name "{sample}" in your rule salmon_quant in two different contexts, so they mean completely different things. The {sample} in the output section (the one that I renamed) is a wildcard, while the {sample} in the input is the variable for the expand function. That is why you observe that substitution "anomaly".

    If you want the wildcard to match the one from your output section, you should do:

    rule salmon_quant:
        input:
            r1 = lambda wildcards: repl_expand([wildcards.dataset], 1),
            r2 = lambda wildcards: repl_expand([wildcards.dataset], 2),
            index = "path/to/salmon/salmon_sa_index"
        output:
            "quants/{dataset}.sf"
    

    Actually you don't need that ugly list of one element if you simplify the implementation of repl_expand removing the unneeded loop for templ in templates:.