I am trying to create a snakefile that lets me run a workflow on any dataset that I want. I will give you a shortened version of what I am trying to do:
The command that I use to run the snakefile is as follows:
snakemake -j 20 -s /path/to/snakefile --config workdir=/path/to/workdir data_dir=/path/to/data/dir
I've included the part of the snakefile that causes me trouble and the necessary context on how some of the wildcards are created (tips on some better coding practices are appreciated, if you found some (^: ):
# Import functions
import os
from pathlib import Path
import subprocess
# Global variables
WORKDIR = Path(config['workdir'])
DATA_DIR = Path(config['data_dir'])
SCRIPTS = "/path/to/scripts
# Import read files
SAMPLES = f'{DATA_DIR}/{{sample_nmbr}}.{{extension}}'
SAMPLE_NMBR = glob_wildcards(SAMPLES)
# Create unique entries for SAMPLE_NMBR
SAMPLE_NMBR = tuple(set(SAMPLE_NMBR))
# Define output for every rule
rule all:
input:
# get output for tool e
expand("{workdir}/{sample_nmbr}/e_run/output_e.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
# get output for tool c
expand("{workdir}/{sample_nmbr}/c_run/output_c.txt", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
# copying e samples to the same directory
expand("{workdir}/e_samples/{sample_nmbr}.e", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
# copying c samples to the same directory
expand("{workdir}/c_samples/{sample_nmbr}.c", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR),
# Process the results
expand("{workdir}/last_tool_output.txt", workdir = WORKDIR)
INPUT = f'{DATA_DIR}/{{sample_nmbr}}.txt'
# Run the 2 tools on the input data
rule run_tool:
input:
input_file = INPUT
output:
tool_c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
tool_e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt"
message:
"Performing tool e and c on {wildcards.sample_nmbr}"
shell:
"""
tool_c {input.input_file} {tool_c_output}
tool_e {input.input_file} {tool_e_output}
"""
# copy output of the 2 tools to the same repective directory as preparation of the final rule
rule copy_output:
input:
c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt",
output:
c_copied = "{workdir}/c_together/{sample_nmbr}.c",
e_copied = "{workdir}/e_together/{sample_nmbr}.e",
checkpoint_copy_output: touch("{workdir}/copying_done.txt")
message:
"Copying the output data"
shell:
"""
cp {input.c_output } {output.c_copied}
cp {input.e_output } {output.e_copied}
"""
# Get final file that I need, which is an output of the final custom script
rule clean_data:
input:
checkpoint_copy_output: rules.copy_output.checkpoint_copy_output
output:
output_that_I_need = "{workdir}/last_tool_output.txt"
params:
scripts = SCRIPTS,
workdir = WORKDIR,
shell:
"""
# Clean up data
{params.scripts}/custom_script_c.py {params.workdir}/c_together {params.scripts}
{params.scripts}/custom_script_e.py {params.workdir}/e_together {params.scripts}
{params.scripts}/custom_script_final.py {output.output_that_I_need}
"""
So for an extra explanation, the first rule is rule all, which defines the output that I want, naturally. Then, the rule run_tool runs 2 non-descriptive tools that give an output for each sample. The rule copy_output uses the output of the run_tool rule, and copies each output file to a directory with the other output of the specific tool ( so you get 1 directory with all the output.c files and another one with the output.e file). Then, finally, the final rule is executed, but has nothing in common with the previous rules when it comes to it's wildcards, except for the work directory.
That is why I included the checkpoint_copy_output line in the copy_output rule, as to force the clean_data rule to execute only when copy_output is finished. If I exclude this, the clean_data rule will run before anything else and the snakefile will report an error.
But when I include it, snakemake throws the error for the shell line in rule copy_output:
not all output log and benchmark files of rule copy_output contain the same wildcards
.
Including the checkpoint file as parameter also doesn't work:
params:
checkpoint_extract = "{workdir}/extract_done.txt"
shell:
"""
cp {input.c_output } {output.c_copied}
cp {input.e_output } {output.e_copied}
"""
# Get final file that I need, which is an output of the final custom script
rule clean_data:
input:
checkpoint_copy_output: "{workdir}/copying_done.txt"
From which i get the error: Missing input files for rule clean_data: /path/to/workdir/copying_done.txt
I am competely stuck on this problem and haven't found anywhere else online how to possibly solve this. I know that the wildcards need to be the same when you aren't using some complex code to bypass this, but haven't been able to reproduce that. If someone can tell me how to change my code or snakefile setup to let it work, I would greatly appreciate that.
Thanks in advance,
Matthijs
I had to fix a few lines to make it execute, so I most likely changed something which is important to your original question. The two main things are:
See below for a version which seems to work (again, maybe i missed the point)
# Import functions
import os
from pathlib import Path
import subprocess
# Global variables
WORKDIR = Path(config['workdir'])
DATA_DIR = Path(config['data_dir'])
SCRIPTS = "/path/to/scripts"
# Import read files
SAMPLES = f'{DATA_DIR}/{{sample_nmbr}}.txt'
SAMPLE_NMBR = glob_wildcards(SAMPLES).sample_nmbr
print(SAMPLE_NMBR)
# Create unique entries for SAMPLE_NMBR
SAMPLE_NMBR = tuple(set(SAMPLE_NMBR))
# Define output for every rule
rule all:
input:
expand("{workdir}/last_tool_output.txt", workdir = WORKDIR)
INPUT = f'{DATA_DIR}/{{sample_nmbr}}.txt'
# Run the 2 tools on the input data
rule run_tool:
input:
input_file = INPUT
output:
tool_c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
tool_e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt"
message:
"Performing tool e and c on {wildcards.sample_nmbr}"
shell:
"""
tool_c {input.input_file} {output.tool_c_output}
tool_e {input.input_file} {output.tool_e_output}
"""
# copy output of the 2 tools to the same repective directory as preparation of the final rule
rule copy_output:
input:
c_output = "{workdir}/{sample_nmbr}/c_run/output_c.txt",
e_output = "{workdir}/{sample_nmbr}/e_run/output_e.txt",
output:
c_copied = "{workdir}/c_together/{sample_nmbr}.c",
e_copied = "{workdir}/e_together/{sample_nmbr}.e"
message:
"Copying the output data"
shell:
"""
cp {input.c_output} {output.c_copied}
cp {input.e_output} {output.e_copied}
"""
# Get final file that I need, which is an output of the final custom script
rule clean_data:
input:
c_copied = [expand("{workdir}/c_together/{sample_nmbr}.c", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR)],
e_copied = [expand("{workdir}/e_together/{sample_nmbr}.e", workdir = WORKDIR, sample_nmbr = SAMPLE_NMBR)]
output:
output_that_I_need = "{workdir}/last_tool_output.txt"
params:
scripts = SCRIPTS,
workdir = WORKDIR,
shell:
"""
# Clean up data
{params.scripts}/custom_script_c.py {params.workdir}/c_together {params.scripts}
{params.scripts}/custom_script_e.py {params.workdir}/e_together {params.scripts}
{params.scripts}/custom_script_final.py {output.output_that_I_need}
"""