Search code examples
snakemake

Snakemake: Pipeline to Process Samples with Multiple Input Files


I'm new to Snakemake and I'm trying to put together my own pipeline after completing the tutorial and I'm running into some issues when trying to create a pipeline to handle samples with multiple input files. I'm trying to create a pipeline that does the following steps:

  1. Read all the input files present in a sample directory (ex. data/experiment/sample/)and run a python script on each of them producing an output file into a results directory (ex. results/experiment/sample/)
  2. Combine all the results files present in the results directory into a single file (ex. results/experiment/sample/merged.txt)

An example directory structure I am working with is as follows (the total number of files per sample may different depending on the sample):

data
    experiment
        Sample_1
            Sample_1_0.txt
            Sample_1_1.txt
            Sample_1_2.txt

My config file as it stands now is as follows (trying to run the pipeline for a single sample):

exp_dir: experiment
sample: Sample_1
distance: 450
max_wavelength: 710
min_wavelength: 575

My pipeline that I've tried to put together is as follows in my Snakefile:

configfile: "config.yaml"

EXP_DIR=config["exp_dir"]
SAMPLE=config["sample"]
FINAL_FILE=f"results/{EXP_DIR}/{SAMPLE}/{SAMPLE}_complete.txt"

wcs = glob_wildcards("data" + "/"  + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt")

rule all:
    input:
        FINAL_FILE

rule detect_peaks:
    input:
        "data" + "/"  + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt"
    output:
        "results/" + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt"
    params:
        distance=config["distance"],
        max_wavelength=config["max_wavelength"],
        min_wavelength=config["min_wavelength"]
    shell:
        "python scripts/peak_detection.py -i {input} -o {output} -d {params.distance} "
        "-m {params.min_wavelength} -x {params.max_wavelength}"


rule agg_files:
    input:
        expand("data/results" + "/"  + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt", ts=wcs["rep"])
    output:
        FINAL_FILE
    output:
        "python scripts/combine_peaks.py -i {input} -o {output}"

When I try to build the DAG before running the pipeline I get the following error:
TypeError in file XXXX, line 29:
tuple indices must be integers or slices, not str

Any help/suggestions as to how I can fix this issue or improve my pipeline would be greatly appreciated!


Solution

  • The specific error is due to how you access the contents of wcs. It should be wcs.rep instead of wcs["rep"], the underlying structure is a tuple, not a dict.

    The rest of your snakefile will work but here are a few style tips:

    • Replace your exp_dir sample setup with just a path to the experiments. I like to keep these in the config under a paths key. So your config would be:
    paths:
      raw_data: "data/experiment/Sample_1/Sample_1_{rep}.txt"
      output: "results/experiment/Sample_1/Sample_1_complete.txt"
    

    It's more flexible in that a user can decide later they want to change the file tree and it will still work. You can also add wildcards a bit easier, say you had multiple experiments and samples. You can glob_wildcards all of those and expand (with a zip!) and everything will work basically the same.

    • Ignoring that, consider using f-strings or os.path.join to make your filenames.
     "data" + "/"  + EXP_DIR + "/" + SAMPLE + "/" + SAMPLE + "_{rep}.txt"
     # vs
     f"data/{EXP_DIR}/{SAMPLE}/{SAMPLE}_{{rep}}.txt"
     # or
     os.path.join("data", EXP_DIR, SAMPLE, SAMPLE + "_{rep}.txt")
    
    • You don't need to use params to pass in a config value, you can just use it directly:
    -x {params.max_wavelength}
    # vs
    -x {config[max_wavelength]}
    
    • Instead of using the shell directive, consider using the script directive to simplify passing in snakemake variables. If you do this, disregard my previous suggestion.