Search code examples
automationoutputwildcardsnakemake

Snakemake issue: wildcard problems when trying to force a rule to be ran after another rule


I am trying to create a snakefile that lets me run a workflow on any dataset that I want. I will give you a shortened version of what I am trying to do:

The command that I use to run the snakefile is as follows:

snakemake -j 20 -s /path/to/snakefile --config workdir=/path/to/workdir data_dir=/path/to/data/dir

I've included the part of the snakefile that causes me trouble and the necessary context on how some of the wildcards are created (tips on some better coding practices are appreciated, if you found some (^: ):

# Import functions
import os
from pathlib import Path

import subprocess
# Global variables
WORKDIR = Path(config['workdir']) 
DATA_DIR = Path(config['data_dir'])
SCRIPTS = "/path/to/scripts

# Import read files
SAMPLES = f'{DATA_DIR}/{{sample_nmbr}}.{{extension}}'

SAMPLE_NMBR = glob_wildcards(SAMPLES)

# Create unique entries for SAMPLE_NMBR
SAMPLE_NMBR = tuple(set(SAMPLE_NMBR))



# Define output for every rule
rule all:
    input:
        # get output for tool e
        expand("{workdir}/{sample_nmbr}/e_run/output_e.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
        # get output for tool c
        expand("{workdir}/{sample_nmbr}/c_run/output_c.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
        # copying e samples to the same directory
        expand("{workdir}/e_samples/{sample_nmbr}.e", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
        # copying c samples to the same directory
        expand("{workdir}/c_samples/{sample_nmbr}.c", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
        # Process the results
        expand("{workdir}/last_tool_output.txt", workdir = WORKDIR)

INPUT = f'{DATA_DIR}/{{sample_nmbr}}.txt'

# Run the 2 tools on the input data
rule run_tool:
    input:
        input_file = INPUT 
    output:
        tool_c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
        tool_e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt"
    message:
        "Performing tool e and c on {wildcards.sample_nmbr}"
    shell:
        """
        tool_c {input.input_file} {tool_c_output}
        tool_e {input.input_file} {tool_e_output}
        """

# copy output of the 2 tools to the same repective directory as preparation of the final rule
rule copy_output:
    input:
        c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
        e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt",
    output:
        c_copied = "{workdir}/c_together/{sample_nmbr}.c",
        e_copied = "{workdir}/e_together/{sample_nmbr}.e",
        checkpoint_copy_output: touch("{workdir}/copying_done.txt")
    message:
        "Copying the output data"
    shell:
        """
        cp {input.c_output } {output.c_copied}
        cp {input.e_output } {output.e_copied}
        """

# Get final file that I need, which is an output of the final custom script
rule clean_data:
    input:
        checkpoint_copy_output: rules.copy_output.checkpoint_copy_output
    output:
        output_that_I_need = "{workdir}/last_tool_output.txt"
    params:
        scripts = SCRIPTS,
        workdir = WORKDIR,
    shell:
        """
        # Clean up data
        {params.scripts}/custom_script_c.py {params.workdir}/c_together {params.scripts}
        {params.scripts}/custom_script_e.py {params.workdir}/e_together {params.scripts}
        {params.scripts}/custom_script_final.py {output.output_that_I_need}
        """

So for an extra explanation, the first rule is rule all, which defines the output that I want, naturally. Then, the rule run_tool runs 2 non-descriptive tools that give an output for each sample. The rule copy_output uses the output of the run_tool rule, and copies each output file to a directory with the other output of the specific tool ( so you get 1 directory with all the output.c files and another one with the output.e file). Then, finally, the final rule is executed, but has nothing in common with the previous rules when it comes to it's wildcards, except for the work directory.

That is why I included the checkpoint_copy_output line in the copy_output rule, as to force the clean_data rule to execute only when copy_output is finished. If I exclude this, the clean_data rule will run before anything else and the snakefile will report an error.

But when I include it, snakemake throws the error for the shell line in rule copy_output: not all output log and benchmark files of rule copy_output contain the same wildcards.

Including the checkpoint file as parameter also doesn't work:

    params:
        checkpoint_extract = "{workdir}/extract_done.txt"
    shell:
        """
        cp {input.c_output } {output.c_copied}
        cp {input.e_output } {output.e_copied}
        """

# Get final file that I need, which is an output of the final custom script
rule clean_data:
    input:
        checkpoint_copy_output: "{workdir}/copying_done.txt"

From which i get the error: Missing input files for rule clean_data: /path/to/workdir/copying_done.txt

I am competely stuck on this problem and haven't found anywhere else online how to possibly solve this. I know that the wildcards need to be the same when you aren't using some complex code to bypass this, but haven't been able to reproduce that. If someone can tell me how to change my code or snakefile setup to let it work, I would greatly appreciate that.

Thanks in advance,

Matthijs


Solution

  • I had to fix a few lines to make it execute, so I most likely changed something which is important to your original question. The two main things are:

    • Why are you asking for the extension with the glob_wildcard if you only work with .txt (I guess this is only true for your example?)
    • Not sure why you need the checkpoint. In you example you have output files, which are needed for the clean_data scripts. If so, why not use the output of (copy_output) as the clean_data input.
    • clean_data could not work the way you wrote it, because you have two wildcards as input, and only one as its output. So either you have one output file per {sample_nmbr}, or you want all the files as its input, than you need to create a list as its input (the way I did it below), to tell snakemake that the rule only have to be run once, with all previous files as input and one output.

    See below for a version which seems to work (again, maybe i missed the point)

    # Import functions
    import os
    from pathlib import Path
    
    import subprocess
    # Global variables
    WORKDIR = Path(config['workdir']) 
    DATA_DIR = Path(config['data_dir'])
    SCRIPTS = "/path/to/scripts"
    
    # Import read files
    SAMPLES = f'{DATA_DIR}/{{sample_nmbr}}.txt'
    
    SAMPLE_NMBR = glob_wildcards(SAMPLES).sample_nmbr
    print(SAMPLE_NMBR)
    # Create unique entries for SAMPLE_NMBR
    SAMPLE_NMBR = tuple(set(SAMPLE_NMBR))
    
    
    # Define output for every rule
    rule all:
        input:
            expand("{workdir}/last_tool_output.txt", workdir = WORKDIR)
    
    INPUT = f'{DATA_DIR}/{{sample_nmbr}}.txt'
    
    # Run the 2 tools on the input data
    rule run_tool:
        input:
            input_file = INPUT 
        output:
            tool_c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
            tool_e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt"
        message:
            "Performing tool e and c on {wildcards.sample_nmbr}"
        shell:
            """
            tool_c {input.input_file} {output.tool_c_output}
            tool_e {input.input_file} {output.tool_e_output}
            """
    
    # copy output of the 2 tools to the same repective directory as preparation of the final rule
    rule copy_output:
        input:
            c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
            e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt",
        output:
            c_copied = "{workdir}/c_together/{sample_nmbr}.c",
            e_copied = "{workdir}/e_together/{sample_nmbr}.e"
        message:
            "Copying the output data"
        shell:
            """
            cp {input.c_output} {output.c_copied}
            cp {input.e_output} {output.e_copied}
            """
    
    
    # Get final file that I need, which is an output of the final custom script
    rule clean_data:
        input:
            c_copied = [expand("{workdir}/c_together/{sample_nmbr}.c", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR)],
            e_copied = [expand("{workdir}/e_together/{sample_nmbr}.e", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR)]
        output:
            output_that_I_need = "{workdir}/last_tool_output.txt"
        params:
            scripts = SCRIPTS,
            workdir = WORKDIR,
        shell:
            """
            # Clean up data
            {params.scripts}/custom_script_c.py {params.workdir}/c_together {params.scripts}
            {params.scripts}/custom_script_e.py {params.workdir}/e_together {params.scripts}
            {params.scripts}/custom_script_final.py {output.output_that_I_need}
            """