Search code examples
pythonbioinformaticssnakemake

Snakemake - set final output filename from configuration


I'm working on a Snakemake pipeline to extract a list of variants from individual VCF files, then merge them into a single, aggregated (i.e., multi-sample) VCF file.

Here is a toy implementation:

config.yml:

---
variants_file: "input/variants_all_grch38.tsv"
wgs_file: "input/wgs_gvcf_files_summary_data.tsv"

Snakefile:

configfile: "config.yml"

import pandas as pd

# Returns concatenated single-string output of all variants from input file, e.g., loci = "1:12345,2:34567,3:45678"
def get_loci(variants_tsv: str) -> str:
    variants = pd.read_table(variants_tsv, sep="\t")
    variants = variants.sort_values(["chr", "pos"], ascending=True)[["chr", "pos"]]
    loci = ",".join(variants.astype(str).agg(":".join, axis=1))
    return loci

# Returns multiple lists containing "parts" of equal length, to construct complete file path to VCFs
# e.g.
# part1 = ["p1_v1", "p1_v2", "p1_v3", ...]
# part2 = ["p2_v1", "p2_v2", "p2_v3", ...]
# part3 = ["p3_v1", "p3_v2", "p3_v3", ...]
def get_wgs_parts(wgs_gvcf_paths_data_tsv: str) -> tuple[list[str], list[str], list[str]]:
    wgs_parts_cols = [col1, col2, col3]
    wgs_gvcf_paths_data = pd.read_table(wgs_gvcf_paths_data_tsv, sep="\t")
    wgs_gvcf_paths_data = wgs_gvcf_paths_data.loc[(col4 == "X") & (col5 == "Y")][wgs_parts_cols]
    part1, part2, part3 = tuple(wgs_gvcf_paths_data.T.values.tolist())
    return part1, part2, part3

LOCI = get_loci(config["variants_file"])
PART1, PART2, PART3 = get_wgs_parts(config["wgs_file"])

rule extract_loci:
    input:
        vcf="/path/to/wgs/data/{part1}/{part2}/{part3}.vcf.gz",
        idx="/path/to/wgs/data/{part1}/{part2}/{part3}.vcf.gz.tbi"
    output:
        vcf=temp("output/extracted/{part1}/{part2}/{part3}.vcf.gz"),
        idx=temp("output/extracted/{part1}/{part2}/{part3}.vcf.gz.tbi")
    params:
        loci=LOCI
    shell:
        "bcftools filter -r '{params.loci}' {input.vcf} -Oz -o {output.vcf} && "
        "tabix -p vcf {output.vcf}"

rule merge_vcfs:
    input:
        vcf=expand("output/extracted/{part1}/{part2}/{part3}.vcf.gz", zip, part1=PART1, part2=PART2, part3=PART3),
        idx=expand("output/extracted/{part1}/{part2}/{part3}.vcf.gz.tbi", zip, part1=PART1, part2=PART2, part3=PART3)
    output:
        vcf="output/merged/all_variants.vcf.gz",
        idx="output/merged/all_variants.vcf.gz.tbi"
    shell:
        "bcftools merge {input.vcf} | bcftools sort -T /tmp/ -Oz -o {output.vcf} && "
        "tabix -p vcf {output.vcf}"

# I've deliberately avoided using glob_wildcards()
rule all:
    input:
        vcf=expand("/path/to/wgs/data/{part1}/{part2}/{part3}.vcf.gz", zip, part1=PART1, part2=PART2, part3=PART3),
        idx=expand("/path/to/wgs/data/{part1}/{part2}/{part3}.vcf.gz.tbi", zip, part1=PART1, part2=PART2, part3=PART3)
    output:
        vcf="output/merged/all_variants.vcf.gz",
        idx="output/merged/all_variants.vcf.gz.tbi"

I run this pipeline using snakemake --cores 4 "output/merged/all_variants.vcf.gz", & it works correctly.

I'd like to maintain separate aggregated VCFs based on changes I make to the input files (I might change the list of variants to extract, or use a different list of individual VCFs), possibly by having Snakemake use the config file to set the final filename.

I tried adding a new variable in config.yml to control this:

config.yml

variants_file: "input/variants_list1_grch38.tsv"
wgs_file: "input/wgs_gvcf_files_summary_data.tsv"
output_filename: "list1_grch38"

Snakefile

<<truncated>>

rule merge_vcfs:
    input:
        <<truncated>>
    output:
        vcf="output/merged/{output_filename}.vcf.gz",
        idx="output/merged/{output_filename}.vcf.gz.tbi"
    params:
        output_filename=config["output_filename"]
    shell:
        <<truncated>>

rule all:
    input:
        <<truncated>>
    output:
        vcf="output/merged/{output_filename}.vcf.gz",
        idx="output/merged/{output_filename}.vcf.gz.tbi"
    params:
        output_filename=config["output_filename"]

I was expecting that after multiple runs (with different input files) of snakemake --cores 4, I'd have something like this:

$ find $(pwd)
config.yml
input/
input/variants_list1_grch38.tsv
input/variants_list2_grch38.tsv
input/variants_list3_grch37.tsv
input/wgs_gvcf_files_summary_data.tsv
output/
output/merged/
output/merged/list1_grch38.vcf.gz
output/merged/list1_grch38.vcf.gz.tbi
output/merged/list2_grch38.vcf.gz
output/merged/list2_grch38.vcf.gz.tbi
output/merged/list3_grch37.vcf.gz
output/merged/list3_grch37.vcf.gz.tbi
Snakefile

However, the dry-run throws this error:

WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards at the command line, or have a rule without wildcards at the very top of your workflow (e.g. the typical "rule all" which just collects all results you want to generate in the end).

Since the final output filename "all_variants" is hardcoded, I can't differentiate between multiple runs of the pipeline.

I'm essentially looking for a way to tell Snakemake to take the final output filename from config.yml (or some other configurable means), & allow me to simply run snakemake --cores 4.

Any help would be greatly appreciated!


Solution

  • You can simply use entries from your configfile as output for your rule all:

    configfile: "config.yml"
    
    rule all:
        input:
            <<truncated>>
        output:
            vcf=expand("output/merged/{output_filename}.vcf.gz", output_filename=config["output_filename"]),
            idx=expand("output/merged/{output_filename}.vcf.gz.tbi", output_filename=config["output_filename"])
        params:
            output_filename=config["output_filename"]
    

    The reason your code is currently not working is: You don't expand the wildcard for output. Simply providing it to the rule as param is not enough