Search code examples
snakemake

How to process spacific pairs of files with Snakemake?


I am a bit of a newbie with Snakemake and I am currently banging my head on how to process specific pairs of files with a Snakemake pipeline (apologies if the question is lame).
I have a set of fastq files from tumor and matched normal samples. Each tumor sample will have to be processed together with its matched normal one when performing variant calling etc. So, I have prepared a config file with samples listed as follows:

sample_list:
  - sample: 1
    tumor: AO1_04_RN_1_T_4_S4
    control: AO2_07_C007558T1Wa_S37
  - sample: 2
    tumor: AO2_01_C007589T1FTa_S2
    control: AO2_07_C007589T1Wa_S34
  - sample: 3
    tumor: AO9_09_FM_1_T_7_S13 
    control: AO2_07_C007558T1Wa_S37

The first issue I am encountering, however, is that these samples come with 4 fastq.gz files each, which I need to concatenate into 2 - for example, files for tumor sample AO1_04_RN_1_T_4_S4 are:

- AO1_04_RN_1_T_4_S4_L001_R1_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L002_R1_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L001_R2_001.fastq.gz
- AO1_04_RN_1_T_4_S4_L002_R2_001.fastq.gz

and I would need a first rule (say rule concat_fastq) to simply concatenate, for each sample, the R1 and the R2 files into two separate fastq AO1_04_RN_1_T_4_S4_R1.fastq.gz and AO1_04_RN_1_T_4_S4_R2.fastq.gz, as I will need to provide these as inputs for alignment.

At the moment, I am finding that specifying inputs like this:

r1 = expand("{path}/{sample}_L{num}_R1_001.fastq.gz",
        path = config["input_path"],
        num = ["001","002"],
    sample = [sample["tumor"] for sample in config["sample_list"]] + [sample["control"] for sample in config["sample_list"]]),
r2 = expand("{path}/{sample}_L{num}_R2_001.fastq.gz",
        path = config["input_path"],
        num = ["001","002"],
    sample = [sample["tumor"] for sample in config["sample_list"]] + [sample["control"] for sample in config["sample_list"]])

obviously concatenates together too many files each time, while something like this, without using expand:

r1 = ["{path}/{sample}_L001_R1_001.fastq.gz", "{path}/{sample}_L002_R1_001.fastq.gz"],
r2 = ["{path}/{sample}_L001_R2_001.fastq.gz", "{path}/{sample}_L002_R2_001.fastq.gz"]

results in the following error:

Wildcards in input files cannot be determined from output files:
'path'

as I am unable to specify what {path} is.

Can someone please provide some advice on how to deal with paired files using Snakemake? I also hope my config file is structured correctly to then process tumor-normal pairs.

Thanks very much for your help


Solution

  • Using the rule all to setup the sample wildcards simplifies this a bit. Then in concat_fastq we use expand again to get the two lanes for each read pair. I may also suggest using a CSV to specify samples and metadata, as a CSV is easier to parse (imo).

    configfile: "config.yaml"
    
    
    TUMORS = [d["tumor"] for d in config["sample_list"]]
    CONTROLS = [d["control"] for d in config["sample_list"]]
    SAMPLES = TUMORS + CONTROLS
    READS_PATH = "data/reads"
    
    
    rule all:
        input:
            reads=expand(
                "results/concat_fastq/{sample}_R{num}.fastq.gz", sample=SAMPLES, num=[1, 2]
            ),
    
    
    rule concat_fastq:
        input:
            # Use {{ }} to escape sample wildcard in expand.
            r1=expand(READS_PATH + "{{sample}}_L00{lane}_R1.fastq.gz", lane=[1, 2]),
            r2=expand(READS_PATH + "{{sample}}_L00{lane}_R2.fastq.gz", lane=[1, 2]),
        output:
            r1="results/concat_fastq/{sample}_R1.fastq.gz",
            r2="results/concat_fastq/{sample}_R2.fastq.gz",
        shell:
            """
            cat {input.r1} > {output.r1}
            cat {input.r2} > {output.r2}
            """