Search code examples
pythonautomationsnakemake

Snakemake workflow in which wildcards produce different output files


I am building a snakemake workflow in which certain wildcards(populations) have extra steps not shared by all wildcards. I have 8 populations that run a pedigree-based evaluation, and 6 of these 8 populations run, in addition to the pedigree evaluation, a genomic evaluation. My workflow includes a Python script that only generates a genotype file in the case of a population in the genomic workflow. A summary of the issue is given below. The population CHA runs with the genomic workflow, and the population BEL works with the pedigree-based workflow. In the case of the BEL wildcard, the Python script produces the [dlistAnim, phen_file] files, and in the case of the CHA wildcard, the Python script produces [dlistAnim, phen_file, gen_file].

genomic_breeds = {"CHA": "CHAROLAIS"}
breeds = {"CHA": "CHAROLAIS",  "BEL":"BELGIAN BLUE"}


rule extract_phenotype_data:
    input:
        
    params:
        config = "../config_file.yml",
        breed =f"{{breed}}"
    output:
        dlistAnim=f"../listcodeall{{breed}}.txt",
        phen_file=f"../phen_{{breed}}.txt",
        gen_file=f"../genotypes_{{breed}}.txt"
    run:
        cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
        shell(cmd)

The file gen_file is required by the steps after the pedigree-based evaluation that should run for only the genomic breeds (CHA) wildcards.

I have tried the dynamic file command, however, I run into a bug that refers me to https://github.com/snakemake/snakemake/issues/823.

I would expect a workflow that runs for all wildcards to a certain level and then continues for a subset of the wildcards to the end. Additionally, the workflow should account for the files that may not be present in the pedigree-based workflow.

The snakemake version is 7.25.0


Solution

  • I found a solution to this question by using two rules in combination with the wildcard constraints directive.

    The first rule produces two outputs and only works on wildcards not in the genomic_breeds dictionary.

    rule extract_phenotype_data:
            wildcard_constraints:
            breed='|'.join([breed for breed in breeds.keys() if not breed 
            in genomic_breeds])
        input:
            
        params:
            config = "../config_file.yml",
            breed =f"{{breed}}"
        output:
            dlistAnim=f"../listcodeall{{breed}}.txt",
            phen_file=f"../phen_{{breed}}.txt",
        run:
            cmd = f"python /../extract_phenotype_data_for_populations.py 
            --config {params.config} --breed {breeds[params.breed]}"
            shell(cmd)
    

    The second rule uses the wildcard constraints directive as well and only works on wildcards in the genomic_breeds dictionary.

    rule extract_phenotype_data_genomic:
        wildcard_constraints:
            breed='|'.join(list(genomic_breeds.keys()))
        input:
            
        params:
            config = "../config_file.yml",
            breed =f"{{breed}}"
        output:
            dlistAnim=f"../listcodeall{{breed}}.txt",
            phen_file=f"../phen_{{breed}}.txt",
            gen_file=f"../genotypes_{{breed}}.txt"
        run:
            cmd = f"python /../extract_phenotype_data_for_populations.py --config {params.config} --breed {breeds[params.breed]}"
            shell(cmd)
    

    The output of both rules are different in that only one of the rules produces the ../genotypes_{{breed}}.txt file.

    Rule All can be defined as

    rule all:
        input:
            expand(f"../phen_{{breed}}.txt", breed=breeds.keys()),
            expand(f"../listcodeall{{breed}}.txt", breed=breeds.keys()),
            expand(f"../genotypes_{{breed}}.txt", breed =genomic_breeds.keys()),
    

    Perhaps this helps someone someday.