I'm new to Snakemake and I'm trying to put together my own pipeline after completing the tutorial and I'm running into some issues when trying to create a pipeline to handle samples with multiple input files. I'm trying to create a pipeline that does the following steps:
An example directory structure I am working with is as follows (the total number of files per sample may different depending on the sample):
data
experiment
Sample_1
Sample_1_0.txt
Sample_1_1.txt
Sample_1_2.txt
My config file as it stands now is as follows (trying to run the pipeline for a single sample):
exp_dir: experiment
sample: Sample_1
distance: 450
max_wavelength: 710
min_wavelength: 575
My pipeline that I've tried to put together is as follows in my Snakefile:
configfile: "config.yaml"
EXP_DIR=config["exp_dir"]
SAMPLE=config["sample"]
FINAL_FILE=f"results/{EXP_DIR}/{SAMPLE}/{SAMPLE}_complete.txt"
wcs = glob_wildcards("data" + "/" + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt")
rule all:
input:
FINAL_FILE
rule detect_peaks:
input:
"data" + "/" + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt"
output:
"results/" + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt"
params:
distance=config["distance"],
max_wavelength=config["max_wavelength"],
min_wavelength=config["min_wavelength"]
shell:
"python scripts/peak_detection.py -i {input} -o {output} -d {params.distance} "
"-m {params.min_wavelength} -x {params.max_wavelength}"
rule agg_files:
input:
expand("data/results" + "/" + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt", ts=wcs["rep"])
output:
FINAL_FILE
output:
"python scripts/combine_peaks.py -i {input} -o {output}"
When I try to build the DAG before running the pipeline I get the following error:
TypeError in file XXXX, line 29:
tuple indices must be integers or slices, not str
Any help/suggestions as to how I can fix this issue or improve my pipeline would be greatly appreciated!
The specific error is due to how you access the contents of wcs
. It should be wcs.rep
instead of wcs["rep"]
, the underlying structure is a tuple, not a dict.
The rest of your snakefile will work but here are a few style tips:
paths
key. So your config would be:paths:
raw_data: "data/experiment/Sample_1/Sample_1_{rep}.txt"
output: "results/experiment/Sample_1/Sample_1_complete.txt"
It's more flexible in that a user can decide later they want to change the file tree and it will still work. You can also add wildcards a bit easier, say you had multiple experiments and samples. You can glob_wildcards all of those and expand (with a zip!) and everything will work basically the same.
os.path.join
to make your filenames. "data" + "/" + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt"
# vs
f"data/{EXP_DIR}/{SAMPLE}/{SAMPLE}_{{rep}}.txt"
# or
os.path.join("data", EXP_DIR, SAMPLE, SAMPLE + "_{rep}.txt")
-x {params.max_wavelength}
# vs
-x {config[max_wavelength]}
script
directive to simplify passing in snakemake variables. If you do this, disregard my previous suggestion.