Search code examples
pythonbioinformaticssnakemake

Problems with dynamic input and dynamic output rule


I have a quick question regarding the use of dynamic wildcards. I have searched the documentation and forums, but have not found a straightforward answer to my query.

Here are the rules that are giving me trouble:

rule all:
input: dynamic("carvemeOut/{species}.xml")
shell:"snakemake --dag | dot -Tpng > pipemap.png"

rule speciesProt:
input:"evaluation-output/clustering_gt1000_scg.tab"
output: dynamic("carvemeOut/{species}.txt")
shell:
    """
    cd {config[paths][concoct_run]}
    mkdir -p {config[speciesProt_params][dir]}
    cp {input} {config[paths][concoct_run]}/{config[speciesProt_params][dir]}
    cd {config[speciesProt_params][dir]}
    sed -i '1d' {config[speciesProt_params][infile]} #removes first row
    awk '{{print $2}}' {config[speciesProt_params][infile]} > allspecies.txt #extracts node information
    sed '/^>/ s/ .*//' {config[speciesProt_params][metaFASTA]} > {config[speciesProt_params][metaFASTAcleanID]} #removes annotation to protein ID
    Rscript {config[speciesProt_params][scriptdir]}multiFASTA2speciesFASTA.R
    sed -i 's/"//g' species*
    sed -i '/k99/s/^/>/' species*
    sed -i 's/{config[speciesProt_params][tab]}/{config[speciesProt_params][newline]}/' species*
    cd {config[paths][concoct_run]}
    mkdir -p {config[carveme_params][dir]}
    cp {config[paths][concoct_run]}/{config[speciesProt_params][dir]}/species* {config[carveme_params][dir]}
    cd {config[carveme_params][dir]}
    find . -name "species*" -size -{config[carveme_params][cutoff]} -delete #delete files with little information, these cause trouble
    """

rule carveme:
input: dynamic("carvemeOut/{species}.txt")
output: dynamic("carvemeOut/{species}.xml")
shell:
    """
    set +u;source activate concoct_env;set -u
    cd {config[carveme_params][dir]}
    echo {input}
    echo {output}
    carve $(basename {input})
    """

I was previously using two different widlcards for the input and output of the carveme rule:

input: dynamic("carvemeOut/{species}.txt")
output: dynamic("carvemeOut/{gem}.xml")

What I want snakemake to do is to run the carveme rule multiple times, to create an output .xml file for each input .txt file. However, snakemake is instead running the rule one time, using a list of inputs to create one output, as can be seen below:

rule carveme:
input: carvemeOut/species2.txt, carvemeOut/species5.txt, carvemeOut/species1.txt, carvemeOut/species10.txt, carvemeOut/species4.txt, carvemeOut/species17.txt, carvemeOut/species13.txt, carvemeOut/species8.txt, carvemeOut/species14.txt
output: {*}.xml (dynamic)
jobid: 28

After modifying my rules to use the same wildcard, as suggested by @stovfl and shown in the first code box, I get the following error message:

$ snakemake all
Building DAG of jobs...
WildcardError in line 174 of /c3se/NOBACKUP/groups/c3-c3se605-17-8/projects_francisco/binning/snakemake-concot/Snakefile:
Wildcards in input files cannot be determined from output files:
species

Any suggestions on how to address this problem?

Thanks in advance, FZ


Solution

  • You want to have dynamic in your rule all and the rule where the dynamic output is created but not in your last output.

    Here is a working example. Given an input file of species as an example species_example.txt:

    SpeciesA
    SpeciesB
    SpeciesC
    SpeciesD
    

    The following Snakefile will produce dynamically 4 output files

    #Snakefile
    rule all:
    input: 
        dynamic("carvemeOut/{species}.xml"),
    
    rule speciesProt:
        input: "species_example.txt"
        output: dynamic("carvemeOut/{species}.txt")
    shell:  
        """
        awk '{{gsub(/\\r/,"",$1);print  > "carvemeOut/"$1".txt";}}' {input}
        """
    
    
    rule carveme:
        input: "carvemeOut/{species}.txt"
        output: "carvemeOut/{species}.xml"
        shell: "cat {input} > {output}"
    

    Dynamic has a lot of restrictions currently in Snakemake (only one dynamic wildcard allowed see Francisco's comment below, no mixing of non-dynamic and dynamic outputs in the same rule) hence I avoid it if possible. For example, instead of making this example dynamic I would have used a pyhton function to produces a list of possible species names before any rule is ran and use that to expand the wildcards in rule all. Are you sure you need dynamic output?

    Also, you should avoid writing such long shell portions directly in the Snakefile and use either external scripts or break that shell command into multiple rules.