I am trying to make a Snakemake pipeline using a configuration file config.yaml.
---
samples:
- Sample_1: reads/raw_reads_1
- Sample_2: reads/raw_reads_2
- Sample_3: reads/raw_reads_3
reference:
- "path/to/reference.fasta"
...
Following the good practices given in the Snakemake wrapper repository, I have used wrappers commands as much as possible. https://snakemake-wrappers.readthedocs.io/en/stable/ However I have observed that the input/outputs of the different wrappers do not match. Is there a way to manage the names of the files between the rules to connect the wrappers without modifying them?
import os
import snakemake.io
import glob
configfile: "config.yaml"
rule all:
input:
expand("mapped/{sample}.bam", sample=config["samples"])
rule fastqc:
input:
"reads/{sample}.fastq"
output:
html="qc/fastqc/{sample}.html",
zip="qc/fastqc/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params:
extra = "--quiet"
log:
"logs/fastqc/{sample}.log"
threads: 1
resources:
mem_mb = 1024
wrapper:
"v2.6.0/bio/fastqc"
rule bwa_index:
input:
"{genome}.fasta",
output:
idx=multiext("{genome}", ".amb", ".ann", ".bwt", ".pac", ".sa"),
log:
"logs/bwa_index/{genome}.log",
params:
algorithm="bwtsw",
wrapper:
"v2.6.0/bio/bwa/index"
rule bwa_mem:
input:
reads=["reads/{sample}.1.fastq", "reads/{sample}.2.fastq"],
idx=multiext("genome", ".amb", ".ann", ".bwt", ".pac", ".sa"),
output:
"mapped/{sample}.bam",
log:
"logs/bwa_mem/{sample}.log",
params:
extra=r"-R '@RG\tID:{sample}\tSM:{sample}'",
sorting="none", # Can be 'none', 'samtools' or 'picard'.
sort_order="queryname", # Can be 'queryname' or 'coordinate'.
sort_extra="", # Extra args for samtools/picard.
threads: 8
wrapper:
"v2.6.0/bio/bwa/mem"
Wrappers' purpose is only to wrap the logic how to do something. There is no magic way to automatically connect inputs to outputs for rules. It is your job to define input and output values for your purposes. Values in wrappers do not match, as they are set just for testing purposes.
That's the whole answer, however I will try to provide some example, but your config for samples is incorrect.
def get_sample_names():
# return list of sample names
rule all:
input:
bamqc=expand("mapped/{sample}.bam", sample=get_sample_names()),
fastqc=expand("qc/fastqc/{sample}_{pair}.html", sample=get_sample_names(), pair=["1","2"]),
rule bwa_index:
input:
"{prefix_reference}.fasta",
output:
idx=multiext("{prefix_reference}", ".amb", ".ann", ".bwt", ".pac", ".sa"),
log:
"{prefix_reference}/logs/bwa_index.log",
params:
algorithm="bwtsw",
wrapper:
"v2.6.0/bio/bwa/index"
rule bwa_mem:
input:
reads=[
"reads/{sample}.1.fastq.gz",
"reads/{sample}.2.fastq.gz",
],
idx=multiext(
os.path.splitext(config["reference"]),
".amb",
".ann",
".bwt",
".pac",
".sa",
),
output:
"mapped/{sample}.bam",
...
rule fastqc:
input:
"reads/{sample}.{pair}.fastq"
output:
html="qc/fastqc/{sample}_{pair}.html",
zip="qc/fastqc/{sample}_{pair}_fastqc.zip"
params:
extra = "--quiet"
log:
"logs/fastqc/{sample}_{pair}.log"
wrapper:
"v2.6.0/bio/fastqc"
In the example I presumed that all your data is in reads/
. Sample names can be either provided by the config or you can glob for all samples in reads
directory. The other, more complex and advanced way to manage your input reads is to use pepfiles.
I would advise further to check existing workflows, for example by checking newest workflows from the catalog