Error:
Building DAG of jobs...
WildcardErrorin line 502 of /path/to/pipeline/workflow/Snakefile.py:
Wildcards in input files cannot be determined from output files:
'anc_r'
Code:
import os
import json
from datetime import datetime
from glob import iglob
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Define Constants ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# discover input files using path from run config
SAMPLES = list(set(glob_wildcards(f"{config['fastq_dir']}/{{sample}}_R1_001.fastq.gz").sample))
# read output dir path from run config
OUTPUT_DIR = f"{config['output_dir']}"
# Project name and date for bam header
SEQID='pipeline_align'
config['anc_r'] = list(set(glob_wildcards(f"{config['anc_dir']}/{{anc_r}}_R1_001.fastq.gz").anc_r))
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
# https://snakemake.readthedocs.io/en/v7.14.0/tutorial/basics.html#step-7-adding-a-target-rule
rule all:
input:
f'{OUTPUT_DIR}/DONE.txt'
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Set Up Reference Files ~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
# export the current run configuration in JSON format
#
rule export_run_config:
output:
path=f"{OUTPUT_DIR}/00_logs/00_run_config.json"
run:
with open(output.path, 'w') as outfile:
json.dump(dict(config), outfile, indent=4)
#
# make a list of discovered samples
#
rule list_samples:
output:
f"{OUTPUT_DIR}/00_logs/00_sample_list.txt"
shell:
"echo -e '{}' > {{output}}".format('\n'.join(SAMPLES))
#
# copy the supplied reference genome fasta to the pipeline output directory for reference
#
rule copy_fasta:
input:
config['ref_fasta']
output:
f"{OUTPUT_DIR}/01_ref_files/{os.path.basename(config['ref_fasta'])}"
shell:
"cp {input} {output}"
rule index_fasta:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.fai"
conda:
'envs/main.yml'
shell:
"samtools faidx {input}"
rule create_ref_dict:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}".rstrip('fasta') + 'dict'
conda:
'envs/main.yml'
shell:
"picard CreateSequenceDictionary -R {input}"
#
# create a BWA index from the copied fasta reference genome
#
rule create_bwa_index:
input:
rules.copy_fasta.output
output:
f"{rules.copy_fasta.output}.amb",
f"{rules.copy_fasta.output}.ann",
f"{rules.copy_fasta.output}.bwt",
f"{rules.copy_fasta.output}.pac",
f"{rules.copy_fasta.output}.sa",
conda:
'envs/main.yml'
shell:
"bwa index {input}"
#
# Get the ancestor BAM
#
rule align_reads_anc:
input:
rules.create_bwa_index.output,
R1=lambda wildcards: f"{config['anc_r']}/{anc_r}_R1_001.fastq.gz",
R2=lambda wildcards: f"{config['anc_r']}/{anc_r}_R2_001.fastq.gz",
output:
bam=f"{OUTPUT_DIR}/03_init_alignment/{{anc_r}}/{{anc_r}}_R1R2_sort.bam",
conda:
'envs/main.yml'
shell:
r"""bwa mem -R '@RG\tID:""" + SEQID + r"""\tSM:""" + '{wildcards.anc_r}' + r"""\tLB:1'""" + ' {rules.copy_fasta.output} {input.R1} {input.R2} | samtools sort -o {output.bam} - && samtools index {output.bam}'
rule samtools_index_one_anc:
input:
rules.align_reads_anc.output
output:
f"{OUTPUT_DIR}/03_init_alignment/{{anc_r}}/{{anc_r}}_R1R2_sort.bam.bai"
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule samtools_flagstat_anc:
input:
bam=rules.align_reads_anc.output,
idx=rules.samtools_index_one_anc.output
output:
f"{OUTPUT_DIR}/03_init_alignment/{{anc_r}}/{{anc_r}}_R1R2_sort_flagstat.txt"
conda:
'envs/main.yml'
shell:
"samtools flagstat {input.bam} > {output}"
rule picard_mark_dupes_anc:
input:
bam=rules.align_reads_anc.output,
idx=rules.samtools_index_one_anc.output,
dct=rules.create_ref_dict.output
output:
bam=f"{OUTPUT_DIR}/04_picard/{{anc_r}}/{{anc_r}}_comb_R1R2.MD.bam",
metrics=f"{OUTPUT_DIR}/04_picard/{{anc_r}}/{{anc_r}}_comb_R1R2.sort_dup_metrix"
conda:
'envs/main.yml'
shell:
"picard MarkDuplicates --INPUT {input.bam} --OUTPUT {output.bam} --METRICS_FILE {output.metrics} --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY LENIENT"
rule picard_read_groups_anc:
input:
rules.picard_mark_dupes_anc.output.bam
output:
f"{OUTPUT_DIR}/04_picard/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.sort.bam",
conda:
'envs/main.yml'
shell:
f"picard AddOrReplaceReadGroups --INPUT {{input}} --OUTPUT /dev/stdout --RGID {SEQID} --RGLB 1 --RGPU 1 --RGPL illumina --RGSM {{wildcards.anc_r}} --VALIDATION_STRINGENCY LENIENT | samtools sort -o {{output}} -"
rule samtools_index_two_anc:
input:
rules.picard_read_groups_anc.output
output:
f"{OUTPUT_DIR}/04_picard/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.sort.bam.bai",
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule gatk_register:
input:
f'workflow/envs/src/GenomeAnalysisTK-3.7-0-gcfedb67.tar.bz2'
output:
f'{OUTPUT_DIR}/05_gatk/gatk_3.7_registered.txt'
conda:
'envs/main.yml'
shell:
"gatk-register {input} > {output}"
rule gatk_realign_targets_anc:
input:
fa=rules.copy_fasta.output,
bam=rules.picard_read_groups_anc.output,
idx=rules.samtools_index_two_anc.output,
gatk=rules.gatk_register.output,
faidx=rules.index_fasta.output
output:
f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.bam.intervals'
conda:
'envs/main.yml'
shell:
"GenomeAnalysisTK -T RealignerTargetCreator -R {input.fa} -I {input.bam} -o {output}"
rule gatk_realign_indels_anc:
input:
fa=rules.copy_fasta.output,
intervals=rules.gatk_realign_targets_anc.output,
bam=rules.picard_read_groups_anc.output,
idx=rules.samtools_index_two_anc.output
output:
bam=f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.realign.bam',
bai=f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.realign.bai'
conda:
'envs/main.yml'
shell:
"GenomeAnalysisTK -T IndelRealigner -R {input.fa} -I {input.bam} -targetIntervals {input.intervals} -o {output.bam}"
rule samtools_sort_three_anc:
input:
rules.gatk_realign_indels_anc.output.bam
output:
f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.realign.sort.bam',
conda:
'envs/main.yml'
shell:
"samtools sort {input} -o {output}"
rule samtools_index_three_anc:
input:
rules.samtools_sort_three_anc.output
output:
f'{OUTPUT_DIR}/05_gatk/{{anc_r}}/{{anc_r}}_comb_R1R2.RG.MD.realign.sort.bam.bai',
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule index_ancestor_bam:
input:
rules.samtools_sort_three_anc.output
output:
f"{rules.samtools_sort_three_anc.output}.bai"
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule bcftools_pileup_anc:
input:
bam=rules.samtools_sort_three_anc.output,
idx=rules.samtools_index_three_anc.output
output:
f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_samtools_AB.vcf',
conda:
'envs/main.yml'
shell:
"bcftools mpileup --ignore-RG -Ou -ABf {rules.copy_fasta.output} {input.bam} | bcftools call -vmO v -o {output}"
rule freebayes_anc:
input:
bam=rules.samtools_sort_three_anc.output,
idx=rules.samtools_index_three_anc.output
output:
f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_freebayes_BCBio.vcf',
conda:
'envs/main.yml'
shell:
"freebayes -f {rules.copy_fasta.output} --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off --min-alternate-fraction 0.1 {input.bam} > {output}"
rule lofreq_anc:
input:
bam=rules.samtools_sort_three_anc.output,
idx=rules.samtools_index_three_anc.output,
ancidx=rules.index_ancestor_bam.output,
output:
normal=f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_lofreq_normal_relaxed.vcf.gz',
tumor=f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_lofreq_tumor_relaxed.vcf.gz',
somatic=f'{OUTPUT_DIR}/06_variant_calling/{{anc_r}}/{{anc_r}}_lofreq_somatic_final.snvs.vcf.gz',
conda:
'envs/main.yml'
shell:
f"lofreq somatic -n {{input.bam}} -t {{input.bam}} -f {{rules.copy_fasta.output}} -o {OUTPUT_DIR}/06_variant_calling/{{wildcards.anc_r}}/{{wildcards.anc_r}}_lofreq_"
rule unzip_lofreq_anc:
input:
normal=rules.lofreq_anc.output.normal,
tumor=rules.lofreq_anc.output.tumor,
somatic=rules.lofreq_anc.output.somatic,
output:
normal=rules.lofreq_anc.output.normal.replace('.vcf.gz', '.vcf'),
tumor=rules.lofreq_anc.output.tumor.replace('.vcf.gz', '.vcf'),
somatic=rules.lofreq_anc.output.somatic.replace('.vcf.gz', '.vcf'),
conda:
'envs/main.yml'
shell:
"bgzip -d {input.normal} && bgzip -d {input.tumor} && bgzip -d {input.somatic}"
# ~~~~~~~~~~~~~~~~~~~~~~~~ Perform Initial Alignment ~~~~~~~~~~~~~~~~~~~~~~~~ #
rule align_reads:
input:
rules.create_bwa_index.output,
r1=f"{config['fastq_dir']}/{{sample}}_R1_001.fastq.gz",
r2=f"{config['fastq_dir']}/{{sample}}_R2_001.fastq.gz",
output:
f"{OUTPUT_DIR}/03_init_alignment/{{sample}}/{{sample}}_R1R2_sort.bam"
conda:
'envs/main.yml'
shell:
r"""bwa mem -R '@RG\tID:""" + SEQID + r"""\tSM:""" + '{wildcards.sample}' + r"""\tLB:1'""" + ' {rules.copy_fasta.output} {input.r1} {input.r2} | samtools sort -o {output} -'
rule samtools_index_one:
input:
rules.align_reads.output
output:
f"{OUTPUT_DIR}/03_init_alignment/{{sample}}/{{sample}}_R1R2_sort.bam.bai"
conda:
'envs/main.yml'
shell:
"samtools index {input}"
rule samtools_flagstat:
input:
bam=rules.align_reads.output,
idx=rules.samtools_index_one.output
output:
f"{OUTPUT_DIR}/03_init_alignment/{{sample}}/{{sample}}_R1R2_sort_flagstat.txt"
conda:
'envs/main.yml'
shell:
"samtools flagstat {input.bam} > {output}"
rule picard_mark_dupes:
input:
bam=rules.align_reads.output,
idx=rules.samtools_index_one.output,
dct=rules.create_ref_dict.output
output:
bam=f"{OUTPUT_DIR}/04_picard/{{sample}}/{{sample}}_comb_R1R2.MD.bam",
metrics=f"{OUTPUT_DIR}/04_picard/{{sample}}/{{sample}}_comb_R1R2.sort_dup_metrix"
conda:
'envs/main.yml'
shell:
"picard MarkDuplicates --INPUT {input.bam} --OUTPUT {output.bam} --METRICS_FILE {output.metrics} --REMOVE_DUPLICATES true --VALIDATION_STRINGENCY LENIENT"
rule picard_read_groups:
input:
rules.picard_mark_dupes.output.bam
output:
f"{OUTPUT_DIR}/04_picard/{{sample}}/{{sample}}_comb_R1R2.RG.MD.sort.bam",
conda:
'envs/main.yml'
shell:
f"picard AddOrReplaceReadGroups --INPUT {{input}} --OUTPUT /dev/stdout --RGID {SEQID} --RGLB 1 --RGPU 1 --RGPL illumina --RGSM {{wildcards.sample}} --VALIDATION_STRINGENCY LENIENT | samtools sort -o {{output}} -"
rule samtools_index_two:
input:
rules.picard_read_groups.output
output:
f"{OUTPUT_DIR}/04_picard/{{sample}}/{{sample}}_comb_R1R2.RG.MD.sort.bam.bai",
conda:
'envs/main.yml'
shell:
"samtools index {input}"
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ GATK Re-alignment ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
# configure gatk-3.7 inside conda environment
#
rule gatk_realign_targets:
input:
fa=rules.copy_fasta.output,
bam=rules.picard_read_groups.output,
idx=rules.samtools_index_two.output,
gatk=rules.gatk_register.output,
faidx=rules.index_fasta.output
output:
f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.bam.intervals'
conda:
'envs/main.yml'
shell:
"GenomeAnalysisTK -T RealignerTargetCreator -R {input.fa} -I {input.bam} -o {output}"
rule gatk_realign_indels:
input:
fa=rules.copy_fasta.output,
intervals=rules.gatk_realign_targets.output,
bam=rules.picard_read_groups.output,
idx=rules.samtools_index_two.output
output:
bam=f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.RG.MD.realign.bam',
bai=f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.RG.MD.realign.bai'
conda:
'envs/main.yml'
shell:
"GenomeAnalysisTK -T IndelRealigner -R {input.fa} -I {input.bam} -targetIntervals {input.intervals} -o {output.bam}"
rule samtools_sort_three:
input:
rules.gatk_realign_indels.output.bam
output:
f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.RG.MD.realign.sort.bam',
conda:
'envs/main.yml'
shell:
"samtools sort {input} -o {output}"
rule samtools_index_three:
input:
rules.samtools_sort_three.output
output:
f'{OUTPUT_DIR}/05_gatk/{{sample}}/{{sample}}_comb_R1R2.RG.MD.realign.sort.bam.bai',
conda:
'envs/main.yml'
shell:
"samtools index {input}"
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Variant Calling ~~~~~~~~~~~~~~~~~~~~~~~~~~ #
rule bcftools_pileup:
input:
bam=rules.samtools_sort_three.output,
idx=rules.samtools_index_three.output
output:
f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_samtools_AB.vcf',
conda:
'envs/main.yml'
shell:
"bcftools mpileup --ignore-RG -Ou -ABf {rules.copy_fasta.output} {input.bam} | bcftools call -vmO v -o {output}"
rule freebayes:
input:
bam=rules.samtools_sort_three.output,
idx=rules.samtools_index_three.output
output:
f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_freebayes_BCBio.vcf',
conda:
'envs/main.yml'
shell:
"freebayes -f {rules.copy_fasta.output} --pooled-discrete --pooled-continuous --report-genotype-likelihood-max --allele-balance-priors-off --min-alternate-fraction 0.1 {input.bam} > {output}"
rule lofreq:
input:
bam=rules.samtools_sort_three.output,
idx=rules.samtools_index_three.output,
ancidx=rules.index_ancestor_bam.output,
output:
normal=f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_lofreq_normal_relaxed.vcf.gz',
tumor=f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_lofreq_tumor_relaxed.vcf.gz',
somatic=f'{OUTPUT_DIR}/06_variant_calling/{{sample}}/{{sample}}_lofreq_somatic_final.vcf.gz',
conda:
'envs/main.yml'
shell:
f"lofreq somatic -n {{rules.copy_ancestor_bam.output}} -t {{input.bam}} -f {{rules.copy_fasta.output}} -o {{output.somatic}} && "
f"bgzip {{output.normal}} && bgzip {{output.tumor}} && bgzip {{output.somatic}}"
rule unzip_lofreq:
input:
normal=rules.lofreq.output.normal,
tumor=rules.lofreq.output.tumor,
somatic=rules.lofreq.output.somatic,
output:
normal=rules.lofreq.output.normal.replace('.vcf.gz', '.vcf'),
tumor=rules.lofreq.output.tumor.replace('.vcf.gz', '.vcf'),
somatic=rules.lofreq.output.somatic.replace('.vcf.gz', '.vcf'),
conda:
'envs/main.yml'
shell:
"bgzip -d {input.normal} && bgzip -d {input.tumor} && bgzip -d {input.somatic}"
# ~~~~~~~~~~~~~~~~~~~~~~~~~~ Begin Filtering Steps ~~~~~~~~~~~~~~~~~~~~~~~~~~ #
#
# filter samtools results by ancestor, took normal from the output
#
rule anc_filter_samtools:
input:
sample=rules.bcftools_pileup.output,
anc=rules.unzip_lofreq_anc.output.normal
output:
f'{OUTPUT_DIR}/07_filtered/{{sample}}/{{sample}}_samtools_AB_AncFiltered.vcf',
conda:
'envs/main.yml'
shell:
f"bedtools intersect -v -header -a {{input.sample}} -b {{input.anc}} > {{output}}"
Tried fixing wildcards in previous rules.Same error. Tried altering config dict. Same error. Tried altering previous rules. Same error. Tried altering config file. Same error. Tried looking at documentation. Don't understand I'm doing wrong. Thought it was lofreq rule that was the problem. Same error.
P.S. Broken keyboard.
Sorry for your troubles, and yes this is a long example and probably why no one has helped.
For the future, you can use the --debug-dag
option to see the path and wildcards snakemake uses when building the dag. I frequently find bugs/errors with that option.
I also want to point out that instead of using OUTPUT_DIR
in all your outputs, you can set a workdir
and make your other files relative to the working directory.
For your specific error, snakemake does an ok job telling you what is wrong:
WildcardErrorin line 502 of /path/to/pipeline/workflow/Snakefile.py:
Line 502 is the anc_filter_samtools
rule. As outputs you have a vcf with sample
as the wildcard and snakemake complains
Wildcards in input files cannot be determined from output files:
'anc_r'
Which is a wildcard used in the anc
input file. So you have a wildcard in an input file (anc
) without a matching wildcard in the output file.
You either need to include anc
in the output file name or create an input function to convert the sample
to the appropriate anc
.
When you work through rules, remember that snakemake starts by resolving all wildcards in the output file then fills in the other directives, like inputs.