I am new in using snakemake, I have a simple problem when doing mapping in snakemake. I have couples of _1.fastq.gz & _2.fastq.gz and I would like to do pair-end mapping for around 10 pairs of fastq.gz. So I wrote a snakemake file for this.
import os
import snakemake.io
import glob
(SAMPLES,READS,) = glob_wildcards("raw/{sample}_{read}.fastq.gz")
READS=["1","2"]
REF="/data/data/reference/refs/ucsc.hg19.fasta.gz"
rule all:
input: expand("raw/{sample}.bam",sample=SAMPLES)
rule bwa_map:
input:
ref=REF,
r1=expand("raw/{sample}_{read}.fastq.gz",sample=SAMPLES,read=READS),
r2=expand("raw/{sample}_{read}.fastq.gz",sample=SAMPLES,read=READS)
output: "raw/{sample}.bam"
shell: "bwa mem -M -t 8 {input.ref} {input.r1} {input.r2} | samtools view -Sbh - > {output}"
Error:
RuleException:
CalledProcessError in line 17 of /data/data/Samples/snakemake-example/WGS-test/step2.smk:
Command ' set -euo pipefail; bwa mem -M -t 8 /data/data/reference/refs/ucsc.hg19.fasta.gz raw/sub1_1.fastq.gz raw/sub1_2.fastq.gz raw/sub2_1.fastq.gz raw/sub2_2.fastq.gz raw/sub1_1.fastq.gz raw/sub1_2.fastq.gz raw/sub2_1.fastq.gz raw/sub2_2.fastq.gz raw/sub1_1.fastq.gz raw/sub1_2.fastq.gz raw/sub2_1.fastq.gz raw/sub2_2.fastq.gz raw/sub1_1.fastq.gz raw/sub1_2.fastq.gz raw/sub2_1.fastq.gz raw/sub2_2.fastq.gz | samtools view -Sbh - > raw/sub2.bam ' returned non-zero exit status 1.
File "/data/data/Samples/snakemake-example/WGS-test/step2.smk", line 17, in __rule_bwa_map
File "/root/miniconda3/envs/bioinfo/lib/python3.6/concurrent/futures/thread.py", line 56, in run
My desired output would be like generate all 10 bam files like
sub1.bam sub2.bam sub3.bam ...
It looks like put all the fastq files into a command. how can i separate them and run pair by pair automatically without using hard code method. Please advice.
The first rule (here rule all
) specifies the files that you would like to create during your snakemake workflow.
For a given file, f
, in rule all::input
, snakemake will look through all the rules and try to find one that can create f
(based on pattern matching on the output
segment of each rule).
Suppose f = raw/my_sample.bam
Once snakemake
has found a rule that can create f
, it will determine all the input files required to make that file.
So here, snakemake finds that f = raw/my_sample.bam
can be created by rule bwa_map
(since f
matches the pattern raw/<anything>.bam
) and then determines what files are required to make f
based on the input
segment.
Snakemake thinks: I can make raw/my_sample.bam
if I have
the file ref="/data/data/reference/refs/ucsc.hg19.fasta.gz"
the files r1=expand("raw/{sample}_{read}.fastq.gz",sample=SAMPLES,read=READS)
and the files r2=expand("raw/{sample}_{read}.fastq.gz",sample=SAMPLES,read=READS)
In the expand
, r1
expands sample
to each value in SAMPLES and read
to each value in READS. But you have 10 values in SAMPLES and 2 values in READS, so r1
expands to 20 different file paths for each output file that it is attempting to make. It is ignoring the sample
wildcard that is present in the output
clause (because you've overwritten it in the expand
call).
You have to let the wildcards that are defined in the output clause bubble up to the input clause:
import os
import snakemake.io
import glob
(SAMPLES,READS,) = glob_wildcards("raw/{sample}_{read}.fastq.gz")
READS=["1","2"]
REF="/data/data/reference/refs/ucsc.hg19.fasta.gz"
rule all:
input: expand("raw/{sample}.bam",sample=SAMPLES)
rule bwa_map:
input:
ref=REF,
# determine `r1` based on the {sample} wildcard defined in `output`
# and the fixed value `1` to indicate the read direction
r1="raw/{sample}_1.fastq.gz",
# determine `r2` based on the {sample} wildcard similarly
r2="raw/{sample}_2.fastq.gz"
output: "raw/{sample}.bam"
# better to pass in the threads than to hardcode them in the shell command
threads: 8
shell: "bwa mem -M -t {threads} {input.ref} {input.r1} {input.r2} | samtools view -Sbh - > {output}"
I'd urge you to have a look at how the rule for bwa
alignment is written in the snakemake wrappers resource (you might consider using the wrapper as well): https://snakemake-wrappers.readthedocs.io/en/stable/wrappers/bwa/mem.html
Off-topic: From a code review perspective, I question why you are outputting aligned data to your raw
directory? Would it make more sense to output your aligned data to align
or aligned
instead? You also appear to be importing from packages that you don't use.