Snakemake: Handle paired reads in a directory

There is a directory with paired reads, the names and number of pairs are unknown, the possible file extensions are .fastq or .fastq.gz.

I need advice on how best to get paired reads in a directory for processing in the snakemake pipeline.

Solution

There are multiple ways to go about this. My preferred method is to use a CSV to specify read paths. An added benefit of this is that you can add metadata to each set of reads that you can then access in your workflow.

Here is an example for mapping reads for three samples: A, B, C

Directory structure:

.
├── Snakefile
├── data
│   ├── genome
│   │   └── ref.fa
│   └── reads
│       ├── a_R1.fastq.gz
│       ├── a_R2.fastq.gz
│       ├── b_R1.fastq
│       ├── b_R2.fastq
│       ├── c_R1.fq.gz
│       └── c_R2.fq.gz
└── sample_reads.csv

sample_reads.csv:

sample_id,read1,read2
a,data/reads/a_R1.fastq.gz,data/reads/a_R2.fastq.gz
b,data/reads/b_R1.fastq,data/reads/b_R2.fastq
c,data/reads/c_R1.fq.gz,data/reads/c_R2.fq.gz

Snakefile:

import pandas as pd

sample_reads = pd.read_csv("./sample_reads.csv")


rule all:
    """
    Get all sample_ids from dataframe and use in expand to make sam 
    files for each sample.
    """
    input:
        expand(
            "results/alignments/{sample}.sam",
            sample=sample_reads["sample_id"].tolist(),
        ),


rule align_reads:
    """
    Use sample wildcard to lookup read paths in pandas dataframe.
    Need to use input function to access sample wildcard value.
    """
    input:
        read_1=lambda wc: sample_reads[sample_reads["sample_id"] == wc.sample][
            "read1"
        ].item(),
        read_2=lambda wc: sample_reads[sample_reads["sample_id"] == wc.sample][
            "read2"
        ].item(),
        ref="data/genome/ref.fa",
    output:
        sam="results/alignments/{sample}.sam",
    shell:
        "bwa mem {input.ref} {input.read_1} {input.read_2} > {output.sam}"