python python-3.x bioinformatics snakemake

Snakemake - dynamically derive the targets from input files

I have a large number of input files organized like this:

data/
├── set1/
│   ├── file1_R1.fq.gz
│   ├── file1_R2.fq.gz
│   ├── file2_R1.fq.gz
│   ├── file2_R2.fq.gz
|   :
│   └── fileX_R2.fq.gz
├── another_set/
│   ├── asdf1_R1.fq.gz
│   ├── asdf1_R2.fq.gz
│   ├── asdf2_R1.fq.gz
│   ├── asdf2_R2.fq.gz
|   :
│   └── asdfX_R2.fq.gz
:   
└── many_more_sets/
    ├── zxcv1_R1.fq.gz
    ├── zxcv1_R2.fq.gz
    :
    └── zxcvX_R2.fq.gz

If you are familiar with bioinformatics - these are of course fastq files from paired end sequencing runs. I am trying to generate a snakemake workflow that reads all of those and I'm already failing at the first rule. This is my attempt:

configfile: "config.yaml"

rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
        read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])

rule clip_and_trim_reads:
    input:
        read1=expand("{data}/{set}/{{sample}}_R1.fq.gz", data=config["raw_data"], set=config["sets"]),
        read2=expand("{data}/{set}/{{sample}}_R2.fq.gz", data=config["raw_data"], set=config["sets"])
    output:
        read1=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R1.fq.gz", output=config["output"]),
        read2=expand("{output}/clipped_and_trimmed_reads/{{sample}}_R2.fq.gz", output=config["output"])
    threads: 16
    shell:
        """
        someTool -o {output.read1} -p {output.read2} \
        {input.read1} {input.read2}
        """

I cannot specify clip_and_trim_reads as a target, because Target rules may not contain wildcards. I tried adding the all rule, but this gives me another error:

$ snakemake -np
Building DAG of jobs...
WildcardError in line 3 of /work/project/Snakefile:
Wildcards in input files cannot be determined from output files:
'sample'

I also tried using the dynamic() function for the all rule, which weirdly did find the files, but still gave me this error:

$ snakemake -np
Dynamic output is deprecated in favor of checkpoints (see docs). It will be removed in Snakemake 6.0.
Building DAG of jobs...
MissingInputException in line 7 of /work/project/ladsie_002/analyses/scripts/2019-08-02_splice_leader_HiC/Snakefile:
Missing input files for rule clip_and_trim_reads:
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R2.fq.gz
data/raw_data/set1/__snakemake_dynamic___R1.fq.gz
[...]

I have over a hundred different files, so I would very much like to avoid creating a list with all filenames. Any ideas how to achieve that?

Solution

I think you misunderstand how snakemake works. When you run snakemake you define which output you want, either on the command-line, or else the input of the first rule in the Snakefile (your rule all) will be generated. Since you do not specify any output (snakemake -np), Snakemake will try to generate the input of rule all.

The input of your rule all basically is:

"somepath/clipped_and_trimmed_reads/{sample}_R1.fq.gz"

Snakemake, unfortunately, does not know how to generate output from this... We need to tell Snakemake which files to use. We can do this manually:

rule all:
    input:
        "somepath/clipped_and_trimmed_reads/file1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/asdf1_R1.fq.gz",
        "somepath/clipped_and_trimmed_reads/zxcv1_R1.fq.gz"

But this becomes quite cumbersome as we get more files, and as you specify in the question, is not what you want. We need to write a small function that that gets all the filenames for us.

import glob
import re

data=config["raw_data"]
samples = []
locations = {}
for file in glob.glob(data + "/**", recursive=True):
    if '_R1.fq.gz' in file:
        split = re.split('/|_R1', file)
        filename, directory = split[-2], split[-3]
        locations[filename] = directory  # we will need this one later
        samples.append(filename)

We can now feed this to our rule all:

rule all:
    input:
        read1=expand("{output}/clipped_and_trimmed_reads/{sample}_R1.fq.gz", output=config["output"], sample=samples),
        read2=expand("{output}/clipped_and_trimmed_reads/{sample}_R2.fq.gz", output=config["output"], sample=samples)

Note that we do not have sample as a wildcard anymore, but we 'expand' it into our read1 and read2, making all possible combinations of output and sample.

We are only halfway done however.. If we call Snakemake like this, it will know exactly which output we want, and which rule can generate this (rule clip_and_trim_reads). However it still does not know where to look for these files. Fortunately we already have a dictionary where we stored these (stored in locations).

rule clip_and_trim_reads:
    input:
        read1=lambda wildcards: expand("{data}/{set}/{sample}_R1.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample),
        read2=lambda wildcards: expand("{data}/{set}/{sample}_R2.fq.gz", data=config["raw_data"], set=locations[wildcards.sample], sample=wildcards.sample)
    output:
        ...

Now everything should work!! And even better; since all our results from the rule clip_and_trim_reads are written to a single folder, continuing from here should be much easier!

p.s. I haven't tested any of the code, so probably not everything works on the first try. However the message remains.