python bioinformatics snakemake snakeyaml

use snakemake pair-end bwa alignment

I am new in using snakemake, I have a simple problem when doing mapping in snakemake. I have couples of _1.fastq.gz & _2.fastq.gz and I would like to do pair-end mapping for around 10 pairs of fastq.gz. So I wrote a snakemake file for this.

import os
import snakemake.io
import glob

(SAMPLES,READS,) = glob_wildcards("raw/{sample}_{read}.fastq.gz")
READS=["1","2"]
REF="/data/data/reference/refs/ucsc.hg19.fasta.gz"

rule all:
    input: expand("raw/{sample}.bam",sample=SAMPLES)

rule bwa_map:
    input:
        ref=REF,
        r1=expand("raw/{sample}_{read}.fastq.gz",sample=SAMPLES,read=READS),
        r2=expand("raw/{sample}_{read}.fastq.gz",sample=SAMPLES,read=READS)

    output: "raw/{sample}.bam"

    shell: "bwa mem -M -t 8 {input.ref} {input.r1} {input.r2} | samtools view -Sbh - > {output}"

Error:

RuleException:
CalledProcessError in line 17 of /data/data/Samples/snakemake-example/WGS-test/step2.smk:
Command ' set -euo pipefail;  bwa mem -M -t 8 /data/data/reference/refs/ucsc.hg19.fasta.gz raw/sub1_1.fastq.gz raw/sub1_2.fastq.gz raw/sub2_1.fastq.gz raw/sub2_2.fastq.gz raw/sub1_1.fastq.gz raw/sub1_2.fastq.gz raw/sub2_1.fastq.gz raw/sub2_2.fastq.gz raw/sub1_1.fastq.gz raw/sub1_2.fastq.gz raw/sub2_1.fastq.gz raw/sub2_2.fastq.gz raw/sub1_1.fastq.gz raw/sub1_2.fastq.gz raw/sub2_1.fastq.gz raw/sub2_2.fastq.gz | samtools view -Sbh - > raw/sub2.bam ' returned non-zero exit status 1.
  File "/data/data/Samples/snakemake-example/WGS-test/step2.smk", line 17, in __rule_bwa_map
  File "/root/miniconda3/envs/bioinfo/lib/python3.6/concurrent/futures/thread.py", line 56, in run

My desired output would be like generate all 10 bam files like

sub1.bam sub2.bam sub3.bam ...

It looks like put all the fastq files into a command. how can i separate them and run pair by pair automatically without using hard code method. Please advice.

Solution

The first rule (here rule all) specifies the files that you would like to create during your snakemake workflow.

For a given file, f, in rule all::input, snakemake will look through all the rules and try to find one that can create f (based on pattern matching on the output segment of each rule).

Suppose f = raw/my_sample.bam

Once snakemake has found a rule that can create f, it will determine all the input files required to make that file.

So here, snakemake finds that f = raw/my_sample.bam can be created by rule bwa_map (since f matches the pattern raw/<anything>.bam) and then determines what files are required to make f based on the input segment.

Snakemake thinks: I can make raw/my_sample.bam if I have the file ref="/data/data/reference/refs/ucsc.hg19.fasta.gz" the files r1=expand("raw/{sample}_{read}.fastq.gz",sample=SAMPLES,read=READS) and the files r2=expand("raw/{sample}_{read}.fastq.gz",sample=SAMPLES,read=READS)

In the expand, r1 expands sample to each value in SAMPLES and read to each value in READS. But you have 10 values in SAMPLES and 2 values in READS, so r1 expands to 20 different file paths for each output file that it is attempting to make. It is ignoring the sample wildcard that is present in the output clause (because you've overwritten it in the expand call).

You have to let the wildcards that are defined in the output clause bubble up to the input clause:

import os
import snakemake.io
import glob

(SAMPLES,READS,) = glob_wildcards("raw/{sample}_{read}.fastq.gz")
READS=["1","2"]
REF="/data/data/reference/refs/ucsc.hg19.fasta.gz"

rule all:
    input: expand("raw/{sample}.bam",sample=SAMPLES)

rule bwa_map:
    input:
        ref=REF,
        # determine `r1` based on the {sample} wildcard defined in `output`
        # and the fixed value `1` to indicate the read direction
        r1="raw/{sample}_1.fastq.gz",
        # determine `r2` based on the {sample} wildcard similarly
        r2="raw/{sample}_2.fastq.gz"

    output: "raw/{sample}.bam"

    # better to pass in the threads than to hardcode them in the shell command
    threads: 8

    shell: "bwa mem -M -t {threads} {input.ref} {input.r1} {input.r2} | samtools view -Sbh - > {output}"

I'd urge you to have a look at how the rule for bwa alignment is written in the snakemake wrappers resource (you might consider using the wrapper as well): https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bwa/mem.html

Off-topic: From a code review perspective, I question why you are outputting aligned data to your raw directory? Would it make more sense to output your aligned data to align or aligned instead? You also appear to be importing from packages that you don't use.