Search code examples
pythonconfigsamplesnakemake

Run snakemake rules iteratively


So I thought I was finally grasping snakemake, but when trying to run several different data files, I realized it doesn't work as I though. This is the Snakefile:

import pandas as pd

configfile: "config.json"
experiments = pd.read_csv(config["experiments"], sep = '\t')
experiments['Name'] = [filename.split('/')[-1].split('.fa')[0] for filename in experiments['Files']]

rule all:
    input:
        expand("{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}.fq", output = config["output"],
            fr = (['_forward_paired', '_reverse_paired'] if experiments["Files"].str.contains(',').tolist() else ''),
               name = experiments['Name'])

rule preprocess:
    input:
        experiments["Files"].str.split(',')
    output:
        expand("{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}.fq", output = config["output"],
            fr = (['_forward_paired', '_reverse_paired'] if experiments["Files"].str.contains(',').tolist() else ''),
               name = experiments['Name'])
    threads:
        config["threads"]
    run:
        shell("python preprocess.py -i {reads} -t {threads} -o {output} -adaptdir MOSCA/Databases/illumina_adapters -rrnadbs MOSCA/Databases/rRNA_databases -d {data_type}",
            output = config["output"], data_type = experiments["Data type"].tolist(), reads = ",".join(input))

this is the config file:

{
  "output": "test_snakemake",
  "threads": 14,
  "experiments": "experiments.tsv"
}

and this is the experiments file

Files   Sample  Data type   Condition
path/to/mg_R1.fastq,path/to/mg_R2.fastq Sample  dna
path/to/a/0.01/mt_0.01a_R1.fastq,path/to/a/0.01/mt_0.01a_R2.fastq   Sample  rna c1
path/to/b/0.01/mt_0.01b_R1.fastq,path/to/b/0.01/mt_0.01b_R2.fastq   Sample  rna c1
path/to/c/0.01/mt_0.01c_R1.fastq,path/to/c/0.01/mt_0.01c_R2.fastq   Sample  rna c1
path/to/a/1/mt_1a_R1.fastq,path/to/a/1/mt_1a_R2.fastq   Sample  rna c2
path/to/b/1/mt_1b_R1.fastq,path/to/b/1/mt_1b_R2.fastq   Sample  rna c2
path/to/c/1/mt_1c_R1.fastq,path/to/c/1/mt_1c_R2.fastq   Sample  rna c2
path/to/a/100/mt_100a_R1.fastq,path/to/a/100/mt_100a_R2.fastq   Sample  rna c3
path/to/b/100/mt_100b_R1.fastq,path/to/b/100/mt_100b_R2.fastq   Sample  rna c3
path/to/c/100/mt_100c_R1.fastq,path/to/c/100/mt_100c_R2.fastq   Sample  rna c3

What I want to do is have preprocess rule treat each row separately. I thought that was the way shell interpreted the command, and it would run the command python preprocess.py -i path/to/mg_R1.fastq,path/to/mg_R2.fastq -t 14 -o test_snakemake -adaptdir MOSCA/Databases/illumina_adapters -rrnadbs MOSCA/Databases/rRNA_databases -d dna, instead it tries to join ALL rows and run this to all samples simultaneously python preprocess.py -i path/to/mg_R1.fastq,path/to/mg_R2.fastq,path/to/a/0.01/mt_0.01a_R1.fastq,path/to/a/0.01/mt_0.01a_R2.fastq,path/to/b/0.01/mt_0.01b_R1.fastq,path/to/b/0.01/mt_0.01b_R2.fastq,... -t 14 -o test_snakemake -adaptdir MOSCA/Databases/illumina_adapters -rrnadbs MOSCA/Databases/rRNA_databases -d dna rna rna rna rna rna rna rna rna rna.

How can I make snakemake consider each row separately?


Solution

  • This is a very common mistake. The thing to remember is that rules should work for a single sample. Snakemake will take your paths (with wildcards) and generate specific jobs from the rules. You've written something that takes all inputs and all outputs, then I presume, preprocess.py expects one input/output.

    Instead, consider one file at a time. For the output, "{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}.fq", how do you generate that file? You would have to match to an input file in your experiments dataframe using the name as a key.

    def preprocess_input(wildcards):
        # get files with matching names
        df = experiments.loc[experiments['Name'] == wildcards.name, 'Files']
        # get first value (in case multiple) and split on commas
        return df.iloc[0].split(',')
    
    rule preprocess:
        input:
            preprocess_input
        output:
            "{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}.fq"
        threads:
            config["threads"]
        shell:
            'python preprocess.py -i {reads} -t {threads} -o {config[output]} ...'
    

    That uses an input function to find the correct input files from the output file. It's not perfect but should get you in the right direction.