Using multiple wildcards in snakemake for differentiate replicates in a tsv file

I've successfully implemented a Snakemake workflow and now want to add the possibility to merge specific samples, whether they are technical or biological replicates. In this case, I was told I should use multiple wildcards and work with a proper data frame. However, I'm pretty unsure what the syntax would even look like and how I can differentiate between entries properly.

My samples.tsv used to look like this:

sample	fq1	fq2
bCalAnn1_1_1	path/to/file	path/to/file
bCalAnn1_1_2	path/to/file	path/to/file
bCalAnn2_1_1	path/to/file	path/to/file
bCalAnn2_1_2	path/to/file	path/to/file
bCalAnn2_2_1	path/to/file	path/to/file
bCalAnn2_3_1	path/to/file	path/to/file

Now it looks like this:

sample	bio_unit	tech_unit	fq1	fq2
bCalAnn1	1	1	path/to/file	path/to/file
bCalAnn1	1	2	path/to/file	path/to/file
bCalAnn2	1	1	path/to/file	path/to/file
bCalAnn2	1	2	path/to/file	path/to/file
bCalAnn2	2	1	path/to/file	path/to/file
bCalAnn2	3	1	path/to/file	path/to/file

Where bio_unit contains the index of the library for one sample and tech_unit is the index of the lanes for one library.

My Snakefile looks like this:

import pandas as pd
import os
import yaml

configfile: "config.yaml"

samples = pd.read_table(config["samples"], index_col="sample")


rule all:
    input:
        expand(config["arima_mapping"] + "{sample}_{unit_bio}_{unit_tech}_R1.bam", 
               sample=samples.sample, unit_bio=samples.unit_bio, unit_tech=samples.unit_tech)



rule r1_mapping:
    input:
        read1 = lambda wc: 
                samples[(samples.sample == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
        ref = config["PacBio_assembly"],
        linker = config["index"] + "pac_bio_assembly.fna.amb"
    output:
        config["arima_mapping"] + "unprocessed_bam/({sample},{unit_bio},{unit_tech})_R1.bam"
    params:
    conda:
        "../envs/arima_mapping.yaml"
    log:
        config["logs"] + "arima_mapping/mapping/({sample},{unit_bio},{unit_tech})_R1.log"
    threads:
        12
    shell:
        "bwa mem -t {threads} {input.ref} {input.read1} | samtools view --threads {threads} -h -b -o {output} 2> {log}"

r1_mapping is basically my first rule which requires no differentiating between replicates. Therefore, I would want to use it for every row. I experimented a little bit with the syntax but ultimately ran into this error:

MissingInputException in rule r1_mapping in line 20 of /home/path/to/assembly_downstream/workflow/rules/arima_mapping.smk:
Missing input files for rule r1_mapping:
    output: /home/path/to/assembly_downstream/intermed/arima_mapping/unprocessed_bam/('bCalAnn1', 1, 1)_R1.bam
    wildcards: sample='bCalAnn1', unit_bio= 1, unit_tech= 1
    affected files:
        /home/path/to/assembly_downstream/data/arima_HiC/('bCalAnn1', 1, 1)_R1.fastq.gz

~~Do I even read the table properly? I´m currently very confused, can anyone give me a hint on where to start fixing this?~~ Fixed by changing:

samples = pd.read_table(config["samples"], index_col=["sample","unit_bio","unit_tech"])

samples = pd.read_table(config["samples"], index_col="sample")

EDIT: New error

Missing input files for rule all:
    affected files:
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...                                                fq2
sample              ...                                                   
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         2  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         3  ...  /home/path/to/assembly_downstream/data/ari...

[6 rows x 4 columns]>_3_2_R1.bam
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...                                                fq2
sample              ...                                                   
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         2  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         3  ...  /home/path/to/assembly_downstream/data/ari...

[6 rows x 4 columns]>_3_1_R1.bam
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...

This goes on to a total of 6 times which sounds like I´m currently getting all rows 6 times each red. I guess this has something to do with how expand() works..? I´m gonna keep researching for now.

Solution

This is what I would do, slightly simplified to keep things shorter - not tested, there may be errors!

import pandas as pd
import os
import yaml

samples = pd.read_table('samples.tsv', dtype=str)

wildcard_constraints:
    # Constraints maybe not needed but in my opinion better to set them
    sample='|'.join([re.escape(x) for x in samples['sample']]),
    unit_bio='|'.join([re.escape(x) for x in samples.unit_bio]),
    unit_tech='|'.join([re.escape(x) for x in samples.unit_tech]),

rule all:
    input:
        # Assume for now you just want to produce these bam files,
        # one per row of the sample sheet:
        expand("scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam", zip,
            sample=samples['sample'], unit_bio=samples.unit_bio, unit_tech=samples.unit_tech),


rule r1_mapping:
    input:
        read1 = lambda wc: 
            samples[(samples['sample'] == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
        # Other input/params that don;t need wildcards...
    output:
        "scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam",

The main issue is to use an input function (lambda here) to query the sample sheet and get the fastq file corresponding to a given sample, unit_bio, and unit_tech.

If this doesn't work or you can't make sense of it, post a fully reproducible example that we can easily replicate.