Search code examples
pythonsnakemake

Snakemake: MissingInputException - Missing input files for rule all


I am trying to make a snakemake workflow for whatshap haplotype caller but I am struggling with MissingInputException errors. This is what I get:

Building DAG of jobs...
MissingInputException in line 9 of /srv/KLN/users/esv/KRISPS/whatshap/phased_illumina_FILT5/snakefile:
Missing input files for rule all:
saturna/saturna_phased_illumina_FILT5.vcf.gz
stratos/stratos_phased_illumina_FILT5.vcf.gz
ydun/ydun_phased_illumina_FILT5.vcf.gz
12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz
verdi/verdi_phased_illumina_FILT5.vcf.gz
wotan/wotan_phased_illumina_FILT5.vcf.gz
avarna/avarna_phased_illumina_FILT5.vcf.gz
avenue/avenue_phased_illumina_FILT5.vcf.gz
seresta/seresta_phased_illumina_FILT5.vcf.gz
15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz

If I remove "rule all" and try to produce a single file I get this error:

Building DAG of jobs...
MissingRuleException:
No rule to produce 12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz (if you use input functions make sure that they don't raise unexpected exceptions).

What am I missing? I am new to snakemake so maybe (hopefully) it's just a basic mistake. Here is my code:

shell.prefix("module load whatshap/1.1-foss-2020b-Python-3.8.6; module load BCFtools/1.11-GCC-10.2.0; ")

reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
chroms= ['chr01','chr02','chr03','chr04','chr05','chr06','chr07','chr08','chr09','chr10','chr11','chr12']

rule all:
    input:  
        "12NAE3/12NAE3_phased_illumina_FILT5.vcf.gz",
        "15NOH7/15NOH7_phased_illumina_FILT5.vcf.gz",
        "avarna/avarna_phased_illumina_FILT5.vcf.gz",
        "avenue/avenue_phased_illumina_FILT5.vcf.gz",
        "Kuras/Kuras_phased_illumina_FILT5.vcf.gz",
        "saturna/saturna_phased_illumina_FILT5.vcf.gz",
        "seresta/seresta_phased_illumina_FILT5.vcf.gz",
        "stratos/stratos_phased_illumina_FILT5.vcf.gz",
        "verdi/verdi_phased_illumina_FILT5.vcf.gz",
        "wotan/wotan_phased_illumina_FILT5.vcf.gz",
        "ydun/ydun_phased_illumina_FILT5.vcf.gz"


rule HaplotypeCalling:
    input:
        reference = reference,
        vcf = "/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf",
        bam = "/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam"
    output:
        "temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    params:
        chroms = chroms
    shell:
        "whatshap polyphase --ploidy 4 -o {output} --reference {input.reference} {input.vcf} {input.bam} --sample {wildcards.sample} --chromosome {params.chroms}"


rule SplitVCF:
    input:
        "temp/{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf"
    shell:
        "bcftools view -s {wildcards.sample} -o {output} {input}"
        

rule ConcatVCF:
    input:
        expand("{{sample}}/{sample}_phased_illumina_FILT5_{chrom}.vcf", sample=samples, chrom=chroms)
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf"
    shell:
        "bcftools concat {input} -o {output}"


rule GZipVCF:
    input:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf"    
    output:
        "{{sample}}/{sample}_phased_illumina_FILT5.vcf.gz"
    shell:
        "bgzip -c {input} > {output}"

Edit: the commands I expect for each sample are these, assuming that I only have two chromosomes (chr01 and chr02) (the example is for sample ydun):

#rule HaplotypeCalling
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr01.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr01
whatshap polyphase --ploidy 4 -o temp/ydun/ydun_phased_illumina_FILT5_chr02.vcf --reference /srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa /srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_chr02.vcf /srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/ydun.bam --sample ydun --chromosome chr02

#rule SplitVCF
bcftools view -s ydun -o ydun/ydun_phased_illumina_FILT5_chr01.vcf temp/ydun/ydun_phased_illumina_FILT5_chr01.vcf

#rule ConcatVCF
bcftools concat ydun/ydun_phased_illumina_FILT5_chr01.vcf ydun/ydun_phased_illumina_FILT5_chr02.vcf -o ydun/ydun_phased_illumina_FILT5.vcf

#rule GZipVCF
bgzip -c ydun/ydun_phased_illumina_FILT5.vcf > ydun/ydun_phased_illumina_FILT5.vcf.gz

Solution

  • Based on your expected commands, here is my attempt at refactoring your workflow. I've commented to explain the changes.

    # moved shell prefix to envmodules directive
    # https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html#using-environment-modules
    # that's better for if you need a different set of modules later
    
    reference = "/srv/KLN/users/esv/Reference/DM_v6/DM_1-3_516_R44_potato_genome_assembly.v6.1.fa"
    samples= ['12NAE3', '15NOH7', 'avarna', 'avenue', 'Kuras', 'saturna', 'seresta', 'stratos', 'verdi', 'wotan', 'ydun']
    chroms= [f'chr{chrom:02}' for chrom in range(1, 13)]
    
    # place filenames together or (better) in a config.yaml
    phased_output = '{sample}/{sample}_phased_illumina_FILT5.vcf.gz'
    variants = '/srv/KLN/users/esv/KRISPS/snakemake/2110_variantcalling/results/variants/filtered/FILT5/variants_FILT5_{chrom}.vcf'
    sample_bam = '/srv/KLN/users/esv/KRISPS/Data/Illumina/DM_v6/BAM/{sample}.bam'
    temp_vcf = 'temp/{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'
    split_vcf = '{sample}/{sample}_phased_illumina_FILT5_{chrom}.vcf'
    
    
    # don't repeat yourself with sample names or phased output
    # using expand makes it easier to add new samples later or change your
    # filename
    rule all:
        input:  
            expand(phased_output, sample=samples)
    
    # using variables for filenames makes rules clearer
    # Based on your sample, I think you should pass in wildcards.chrom
    # instead of the chroms list
    # For shell, I like to format the calls with an option per line
    # it makes it easier to see all options and change or remove them.
    # snakemake will combine all the lines so note the spaces at the end of
    # each line!
    rule HaplotypeCalling:
        input:
            reference = reference,
            vcf = variants,
            bam = sample_bam,
        output:
            temp_vcf
        envmodules:
            'whatshap/1.1-foss-2020b-Python-3.8.6'
        shell:
            "whatshap polyphase "
                "--ploidy 4 "
                "-o {output} "
                "--sample {wildcards.sample} "
                "--chromosome {wildcards.chroms}"
                "--reference {input.reference} {input.vcf} {input.bam} "
    
    # options before arguments
    rule SplitVCF:
        input:
            temp_vcf
        output:
            split_vcf
        envmodules:
            'BCFtools/1.11-GCC-10.2.0'
        shell:
            "bcftools view "
                "-s {wildcards.sample} "
                "-o {output} "
                "{input}"
            
    # use output type option of bcf tools to skip the bgzip step
    rule ConcatVCF:
        input:
            expand(split_vcf, chrom=chroms, allow_missing=true)
        output:
            phased_output
        envmodules:
            'BCFtools/1.11-GCC-10.2.0'
        shell:
            "bcftools concat "
                "-o {output} "
                "-O z "  # compressed vcf output
                "{input} "
    

    Not tested but that's my first pass at it!

    Spend some time with the rules page on the documentation and work through the tutorial. Wildcards are important but really subtle. Here's some materials for a workshop I give. It's a little dated but the core material is still good.