snakemake - replacing wildcards in input directive by anonymous function

I am writing a snakemake that will run a bioinformatics pipeline for several input samples. These input files (two for each analysis, one with the partial string match R1 and the second with the partial string match R2) start with a pattern and end with the extension .fastq.gz. Eventually I want to perform multiple operations, though, for this example I just want to align the fastq reads against a reference genome using bwa mem. So for this example my input file is NIPT-N2002394-LL_S19_R1_001.fastq.gz and I want to generate NIPT-N2002394-LL.bam (see code below specifying the directories where input and output are).

My config.yaml file looks like so:

# Run_ID
run: "200311_A00154_0454_AHHHKMDRXX"

# Base directory: the analysis directory from which I will fetch the samples
bd: "/nexusb/nipt/"


# Define the prefix
# will be used to subset the folders in bd
prefix: "NIPT"

# Reference:
ref: "/nexus/bhinckel/19/ONT_projects/PGD_breakpoint/ref_hg19_local/hg19_chr1-y.fasta"

And below is my snakefile

import os
import re
#############
# config file
#############
configfile: "config.yaml"


#######################################
# Parsing variables from config.yaml
#######################################
RUN = config['run']

BD = config['bd']

PREFIX = config['prefix']

FQDIR = f'/nexusb/Novaseq/{RUN}/Unaligned/'

BASEDIR = BD + RUN
SAMPLES = [sample for sample in os.listdir(BASEDIR) if sample.startswith(PREFIX)]
# explanation: in BASEDIR I have multiple subdirectories. The names of the subdirectories starting with PREFIX will be the name of the elements I want to have in the list SAMPLES, which eventually shall be my {sample} wildcard

#############
# RULES
#############
rule all:
    input:
        expand("aligned/{sample}.bam", sample = SAMPLES)


rule bwa_map:
    input:
        REF = config['ref'],
        R1 = FQDIR + "{sample}_S{s}_R1_001.fastq.gz",
        R2 = FQDIR + "{sample}_S{s}_R2_001.fastq.gz"
    output:
        "aligned/{sample}.bam"
    shell:
        "bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"

But I am getting:

Building DAG of jobs...
WildcardError in line 55 of /nexusb/nipt/200311_A00154_0454_AHHHKMDRXX/testMetrics/snakemake/Snakefile:
Wildcards in input files cannot be determined from output files:
's'

When calling snakemake -np

I believe my error lies in the definitions of R1 and R2 in the input directive. I find it puzzling because according to the official documentation snakemake should interpret any wildcard as the regex .+. But it is not doing that for sample NIPT-PearlPPlasma-05-PPx, whose R1 and R2 should be NIPT-PearlPPlasma-05-PPx_S5_R1_001.fastq.gz and NIPT-PearlPPlasma-05-PPx_S5_R2_001.fastq.gz, respectively.

Solution

Take a look again at the snakemake tutorial on how input is inferred from output, anyways I think the problem lies in this piece of code:

output:
    expand("aligned/{sample}.bam", sample = SAMPLES)

And needs to be changed into

output:
    "aligned/{sample}.bam"

What you had didn't work because before expand("aligned/{sample}.bam", sample = SAMPLES) basically becomes a list like this ["aligned/sample0.bam","aligned/sample1.bam"]. When you remove the expand, you only give a "description" of how the output should look like, and thus snakemake can infer the wildcards and input.

edit:

It's difficult to test it since I don't have the actual files, but you should do something like this. Won't work if multiple S-thingies exist.

def get_reads(wildcards):
    R1 = FQDIR + f"{wildcards.sample}_S{{s}}_R1_001.fastq.gz"
    R2 = FQDIR + f"{wildcards.sample}_S{{s}}_R2_001.fastq.gz"
    globbed = glob_wildcards(R1)
    R1, R2 = expand([R1, R2], s=globbed.s)
    return {"R1": R1, "R2": R2}


rule bwa_map:
    input:
        unpack(get_reads),
        REF = config['ref']
    output:
        "aligned/{sample}.bam"
    shell:
        "bwa mem {input.REF} {input.R1} {input.R2}| samtools view -Sb - > {output}"