python pandas bioinformatics pipeline snakemake

How to use list in Snakemake Tabular configuration, for describing of sequencing units for bioinformatic pipeline

How to use a list in Snakemake tabular config.

I use Snakemake Tabular (mapping with BWA mem) configuration to describe my sequencing units (libraries sequenced on separate lines). At the next stage of analysis I have to merge sequencing units (mapped .bed files) and take merged .bam files (one for each sample). Now I'm using YAML config for describing of what units belong to what samples. But I wish to use Tabular config for this purpose,

I'm not clear how to write and recall a list (containing sample information) from cell of Tab separated file.

This is how my Tablular config for units looks like:

Unit    SampleSM    LineID  PlatformPL  LibraryLB   RawFileR1   RawFileR2
sample_001.lane_L1  sample_001  lane_L1 ILLUMINA    sample_001  /user/data/sample_001.lane_L1.R1.fastq.gz   /user/data/sample_001.lane_L1.R2.fastq.gz
sample_001.lane_L2  sample_001  lane_L2 ILLUMINA    sample_001  /user/data/sample_001.lane_L2.R1.fastq.gz   /user/data/sample_001.lane_L2.R2.fastq.gz
sample_001.lane_L8  sample_001  lane_L8 ILLUMINA    sample_001  /user/data/sample_001.lane_L8.R1.fastq.gz   /user/data/sample_001.lane_L8.R2.fastq.gz
sample_002.lane_L1  sample_002  lane_L1 ILLUMINA    sample_002  /user/data/sample_002.lane_L1.R1.fastq.gz   /user/data/sample_002.lane_L1.R2.fastq.gz
sample_002.lane_L2  sample_002  lane_L2 ILLUMINA    sample_002  /user/data/sample_002.lane_L2.R1.fastq.gz   /user/data/sample_002.lane_L2.R2.fastq.gz

This is how my YAML config for Samples looks like:

samples:
 "sample_001": ["sample_001.lane_L1", "sample_001.lane_L2", "sample_001.lane_L8"]
 "sample_002": ["sample_002.lane_L1", "sample_002.lane_L2"]

My Snakemake code:

import pandas as pd
import os

workdir: "/user/data/snakemake/"

configfile: "Samples.yaml"

units_table = pd.read_table("Units.tsv").set_index("Unit", drop=False)

rule all:
    input:
        expand('map_folder/{unit}.bam', unit=units_table.Unit),
        expand('merge_bam_folder/{sample}.bam', sample=config["samples"]),

rule map_paired_end:
    input:
        r1 = lambda wildcards: expand(units_table.RawFileR1[wildcards.unit]),
        r2 = lambda wildcards: expand(units_table.RawFileR2[wildcards.unit])
    output:
        bam = 'map_folder/{unit}.bam'
    params: 
        bai = 'map_folder/{unit}.bam.bai',
        ref='/user/data/human_g1k_v37.fasta.gz',
        SampleSM = lambda wildcards: units_table.SampleSM[wildcards.unit],
        LineID = lambda wildcards: units_table.LineID[wildcards.unit],
        PlatformPL = lambda wildcards: units_table.PlatformPL[wildcards.unit],
        LibraryLB = lambda wildcards: units_table.LibraryLB[wildcards.unit]
    threads:
        16  
    shell:
            r"""
                    seqtk mergepe {input.r1} {input.r2}\
                    | bwa mem -M -t {threads} -v 3 \
                    {params.ref} - \
                    -R "@RG\tID:{params.LineID}\tSM:{params.SampleSM}\tPL:{params.PlatformPL}\tLB:{params.LibraryLB}"\
                    | samtools view -u -Sb - \
                    | samtools sort - -m 4G -o {output.bam} 

                    samtools index {output.bam}
                    """

rule samtools_merge_bam:
    input:  
        lambda wildcards: expand('map_folder/{file}.bam', file=config['samples'][wildcards.sample])
    output:
        bam = 'merge_bam_folder/{sample}.bam'
    threads:
        1
    shell:  
                    r"""
                    samtools merge {output.bam} {input}

                    samtools index {output.bam}
                    """

Solution

What about this below?

I have excluded the Samples.yaml as I think it is not necessary given your sample sheet.

In rule samtools_merge_bam you collect all unit-bam files sharing the same SampleSM. These unit-bam files are created in map_paired_end where the lambda expression collects the fastq files for each unit.

Note also that I have removed the unit-bam files from the all rule as (I think) these are just intermediate files and they could be marked as temporary using the temp() flag.

import pandas as pd
import os

workdir: "/output/dir" 

units_table = pd.read_table("Units.tsv")
samples= list(units_table.SampleSM.unique())

rule all:
    input:
        expand('merge_bam_folder/{sample}.bam', sample= samples),

rule map_paired_end:
    input:
        r1 = lambda wildcards: units_table.RawFileR1[units_table.Unit == wildcards.unit],
        r2 = lambda wildcards: units_table.RawFileR2[units_table.Unit == wildcards.unit],
    output:
        bam = 'map_folder/{unit}.bam'
    params: 
        bai = 'map_folder/{unit}.bam.bai',
        ref='/user/data/human_g1k_v37.fasta.gz',
        SampleSM = lambda wildcards: list(units_table.SampleSM[units_table.Unit == wildcards.unit]),
        LineID = lambda wildcards: list(units_table.LineID[units_table.Unit == wildcards.unit]),
        PlatformPL = lambda wildcards: list(units_table.PlatformPL[units_table.Unit == wildcards.unit]),
        LibraryLB = lambda wildcards: list(units_table.LibraryLB[units_table.Unit == wildcards.unit]),
    threads:
        16  
    shell:
        r"""
        seqtk mergepe {input.r1} {input.r2}\
        | bwa mem -M -t {threads} -v 3 \
        {params.ref} - \
        -R "@RG\tID:{params.LineID}\tSM:{params.SampleSM}\tPL:{params.PlatformPL}\tLB:{params.LibraryLB}"\
        | samtools view -u -Sb - \
        | samtools sort - -m 4G -o {output.bam} 

        samtools index {output.bam}
        """

rule samtools_merge_bam:
    input:  
        lambda wildcards: expand('map_folder/{unit}.bam',
            unit= units_table.Unit[units_table.SampleSM == wildcards.sample])
    output:
        bam = 'merge_bam_folder/{sample}.bam'
    threads:
        1
    shell:  
        r"""
        samtools merge {output.bam} {input}

        samtools index {output.bam}
        """