Search code examples
pandassnakemake

Using multiple wildcards in snakemake for differentiate replicates in a tsv file


I've successfully implemented a Snakemake workflow and now want to add the possibility to merge specific samples, whether they are technical or biological replicates. In this case, I was told I should use multiple wildcards and work with a proper data frame. However, I'm pretty unsure what the syntax would even look like and how I can differentiate between entries properly.

My samples.tsv used to look like this:

sample fq1 fq2
bCalAnn1_1_1 path/to/file path/to/file
bCalAnn1_1_2 path/to/file path/to/file
bCalAnn2_1_1 path/to/file path/to/file
bCalAnn2_1_2 path/to/file path/to/file
bCalAnn2_2_1 path/to/file path/to/file
bCalAnn2_3_1 path/to/file path/to/file

Now it looks like this:

sample bio_unit tech_unit fq1 fq2
bCalAnn1 1 1 path/to/file path/to/file
bCalAnn1 1 2 path/to/file path/to/file
bCalAnn2 1 1 path/to/file path/to/file
bCalAnn2 1 2 path/to/file path/to/file
bCalAnn2 2 1 path/to/file path/to/file
bCalAnn2 3 1 path/to/file path/to/file

Where bio_unit contains the index of the library for one sample and tech_unit is the index of the lanes for one library.

My Snakefile looks like this:

import pandas as pd
import os
import yaml

configfile: "config.yaml"

samples = pd.read_table(config["samples"], index_col="sample")


rule all:
    input:
        expand(config["arima_mapping"] + "{sample}_{unit_bio}_{unit_tech}_R1.bam", 
               sample=samples.sample, unit_bio=samples.unit_bio, unit_tech=samples.unit_tech)



rule r1_mapping:
    input:
        read1 = lambda wc: 
                samples[(samples.sample == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
        ref = config["PacBio_assembly"],
        linker = config["index"] + "pac_bio_assembly.fna.amb"
    output:
        config["arima_mapping"] + "unprocessed_bam/({sample},{unit_bio},{unit_tech})_R1.bam"
    params:
    conda:
        "../envs/arima_mapping.yaml"
    log:
        config["logs"] + "arima_mapping/mapping/({sample},{unit_bio},{unit_tech})_R1.log"
    threads:
        12
    shell:
        "bwa mem -t {threads} {input.ref} {input.read1} | samtools view --threads {threads} -h -b -o {output} 2> {log}"

r1_mapping is basically my first rule which requires no differentiating between replicates. Therefore, I would want to use it for every row. I experimented a little bit with the syntax but ultimately ran into this error:

MissingInputException in rule r1_mapping in line 20 of /home/path/to/assembly_downstream/workflow/rules/arima_mapping.smk:
Missing input files for rule r1_mapping:
    output: /home/path/to/assembly_downstream/intermed/arima_mapping/unprocessed_bam/('bCalAnn1', 1, 1)_R1.bam
    wildcards: sample='bCalAnn1', unit_bio= 1, unit_tech= 1
    affected files:
        /home/path/to/assembly_downstream/data/arima_HiC/('bCalAnn1', 1, 1)_R1.fastq.gz

Do I even read the table properly? I´m currently very confused, can anyone give me a hint on where to start fixing this? Fixed by changing:

samples = pd.read_table(config["samples"], index_col=["sample","unit_bio","unit_tech"])

to

samples = pd.read_table(config["samples"], index_col="sample")

EDIT: New error

Missing input files for rule all:
    affected files:
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...                                                fq2
sample              ...                                                   
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         2  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         3  ...  /home/path/to/assembly_downstream/data/ari...

[6 rows x 4 columns]>_3_2_R1.bam
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...                                                fq2
sample              ...                                                   
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn1         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         1  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         2  ...  /home/path/to/assembly_downstream/data/ari...
bCalAnn2         3  ...  /home/path/to/assembly_downstream/data/ari...

[6 rows x 4 columns]>_3_1_R1.bam
        /home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of           unit_bio  ...  

This goes on to a total of 6 times which sounds like I´m currently getting all rows 6 times each red. I guess this has something to do with how expand() works..? I´m gonna keep researching for now.


Solution

  • This is what I would do, slightly simplified to keep things shorter - not tested, there may be errors!

    import pandas as pd
    import os
    import yaml
    
    samples = pd.read_table('samples.tsv', dtype=str)
    
    wildcard_constraints:
        # Constraints maybe not needed but in my opinion better to set them
        sample='|'.join([re.escape(x) for x in samples['sample']]),
        unit_bio='|'.join([re.escape(x) for x in samples.unit_bio]),
        unit_tech='|'.join([re.escape(x) for x in samples.unit_tech]),
    
    rule all:
        input:
            # Assume for now you just want to produce these bam files,
            # one per row of the sample sheet:
            expand("scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam", zip,
                sample=samples['sample'], unit_bio=samples.unit_bio, unit_tech=samples.unit_tech),
    
    
    rule r1_mapping:
        input:
            read1 = lambda wc: 
                samples[(samples['sample'] == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
            # Other input/params that don;t need wildcards...
        output:
            "scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam", 
    

    The main issue is to use an input function (lambda here) to query the sample sheet and get the fastq file corresponding to a given sample, unit_bio, and unit_tech.

    If this doesn't work or you can't make sense of it, post a fully reproducible example that we can easily replicate.