Search code examples
bioinformaticssnakemake

Snakemake: Handle paired reads in a directory


There is a directory with paired reads, the names and number of pairs are unknown, the possible file extensions are .fastq or .fastq.gz.

I need advice on how best to get paired reads in a directory for processing in the snakemake pipeline.


Solution

  • There are multiple ways to go about this. My preferred method is to use a CSV to specify read paths. An added benefit of this is that you can add metadata to each set of reads that you can then access in your workflow.

    Here is an example for mapping reads for three samples: A, B, C

    Directory structure:

    .
    ├── Snakefile
    ├── data
    │   ├── genome
    │   │   └── ref.fa
    │   └── reads
    │       ├── a_R1.fastq.gz
    │       ├── a_R2.fastq.gz
    │       ├── b_R1.fastq
    │       ├── b_R2.fastq
    │       ├── c_R1.fq.gz
    │       └── c_R2.fq.gz
    └── sample_reads.csv
    

    sample_reads.csv:

    sample_id,read1,read2
    a,data/reads/a_R1.fastq.gz,data/reads/a_R2.fastq.gz
    b,data/reads/b_R1.fastq,data/reads/b_R2.fastq
    c,data/reads/c_R1.fq.gz,data/reads/c_R2.fq.gz
    

    Snakefile:

    import pandas as pd
    
    sample_reads = pd.read_csv("./sample_reads.csv")
    
    
    rule all:
        """
        Get all sample_ids from dataframe and use in expand to make sam 
        files for each sample.
        """
        input:
            expand(
                "results/alignments/{sample}.sam",
                sample=sample_reads["sample_id"].tolist(),
            ),
    
    
    rule align_reads:
        """
        Use sample wildcard to lookup read paths in pandas dataframe.
        Need to use input function to access sample wildcard value.
        """
        input:
            read_1=lambda wc: sample_reads[sample_reads["sample_id"] == wc.sample][
                "read1"
            ].item(),
            read_2=lambda wc: sample_reads[sample_reads["sample_id"] == wc.sample][
                "read2"
            ].item(),
            ref="data/genome/ref.fa",
        output:
            sam="results/alignments/{sample}.sam",
        shell:
            "bwa mem {input.ref} {input.read_1} {input.read_2} > {output.sam}"