I am a bit of a newbie with Snakemake and I am currently banging my head on how to process specific pairs of files with a Snakemake pipeline (apologies if the question is lame).
I have a set of fastq files from tumor and matched normal samples. Each tumor sample will have to be processed together with its matched normal one when performing variant calling etc. So, I have prepared a config file with samples listed as follows:
sample_list:
- sample: 1
tumor: AO1_04_RN_1_T_4_S4
control: AO2_07_C007558T1Wa_S37
- sample: 2
tumor: AO2_01_C007589T1FTa_S2
control: AO2_07_C007589T1Wa_S34
- sample: 3
tumor: AO9_09_FM_1_T_7_S13
control: AO2_07_C007558T1Wa_S37
The first issue I am encountering, however, is that these samples come with 4 fastq.gz files each, which I need to concatenate into 2 - for example, files for tumor sample AO1_04_RN_1_T_4_S4 are:
- AO1_04_RN_1_T_4_S4_L001_R1_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L002_R1_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L001_R2_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L002_R2_001.fastq.gz
and I would need a first rule (say rule concat_fastq
) to simply concatenate, for each sample, the R1 and the R2 files into two separate fastq AO1_04_RN_1_T_4_S4_R1.fastq.gz
and AO1_04_RN_1_T_4_S4_R2.fastq.gz
, as I will need to provide these as inputs for alignment.
At the moment, I am finding that specifying inputs like this:
r1 = expand("{path}/{sample}_L{num}_R1_001.fastq.gz",
path = config["input_path"],
num = ["001","002"],
sample = [sample["tumor"] for sample in config["sample_list"]] + [sample["control"] for sample in config["sample_list"]]),
r2 = expand("{path}/{sample}_L{num}_R2_001.fastq.gz",
path = config["input_path"],
num = ["001","002"],
sample = [sample["tumor"] for sample in config["sample_list"]] + [sample["control"] for sample in config["sample_list"]])
obviously concatenates together too many files each time, while something like this, without using expand
:
r1 = ["{path}/{sample}_L001_R1_001.fastq.gz", "{path}/{sample}_L002_R1_001.fastq.gz"],
r2 = ["{path}/{sample}_L001_R2_001.fastq.gz", "{path}/{sample}_L002_R2_001.fastq.gz"]
results in the following error:
Wildcards in input files cannot be determined from output files:
'path'
as I am unable to specify what {path} is.
Can someone please provide some advice on how to deal with paired files using Snakemake? I also hope my config file is structured correctly to then process tumor-normal pairs.
Thanks very much for your help
Using the rule all to setup the sample wildcards simplifies this a bit. Then in concat_fastq
we use expand again to get the two lanes for each read pair. I may also suggest using a CSV to specify samples and metadata, as a CSV is easier to parse (imo).
configfile: "config.yaml"
TUMORS = [d["tumor"] for d in config["sample_list"]]
CONTROLS = [d["control"] for d in config["sample_list"]]
SAMPLES = TUMORS + CONTROLS
READS_PATH = "data/reads"
rule all:
input:
reads=expand(
"results/concat_fastq/{sample}_R{num}.fastq.gz", sample=SAMPLES, num=[1, 2]
),
rule concat_fastq:
input:
# Use {{ }} to escape sample wildcard in expand.
r1=expand(READS_PATH + "{{sample}}_L00{lane}_R1.fastq.gz", lane=[1, 2]),
r2=expand(READS_PATH + "{{sample}}_L00{lane}_R2.fastq.gz", lane=[1, 2]),
output:
r1="results/concat_fastq/{sample}_R1.fastq.gz",
r2="results/concat_fastq/{sample}_R2.fastq.gz",
shell:
"""
cat {input.r1} > {output.r1}
cat {input.r2} > {output.r2}
"""