Search code examples
pythonconfigsnakemake

Converting a large pipeline to snakemake


I have developed MOSCA, a pipeline for meta-omics analysis (MG with MT), which is available through Bioconda. I want to convert it to snakemake, since it would easily allow MOSCA to run simultaneously some accesses through APIs and some computationally demanding tasks. Also, I think it would help better shape the tool to a standard format.

My question is, MOSCA has a lot of parameters, which will have to be transfered to a configuration file. While this is trivial for most parameters, inputting MG and MT files together is trickier. Also, MOSCA considers samples together. So I created a samples' file, samples.tsv

MG files    MT files    Sample
path/to/mg_R1.fastq,path/to/mg_R2.fastq path/to/mt_R1.fastq,path/to/mt_R2.fastq Sample

and I want the Snakefile to read it and retain the "Sample" information. Follows the example for including the MG preprocessing and the assembly of the pipeline, where the preprocess.py and assembly.py scripts contain the corresponding functionalities. This is the Snakefile

from pandas import read_table

configfile: "config.json"

rule all:
  input:
    expand("{output}/Assembly/{experiment[1][Sample]}/contigs.fasta", 
            output = config["output"], experiment = read_table(config["experiments"], "\t").iterrows())

rule mg_preprocess:             # mt_preprocess make same, but data = mrna?
  input:
    list(expand({experiment}[1]["MG files"], experiment = read_table(config["experiments"], "\t").iterrows()))[0]
  output:
    expand("{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}_paired.fq", 
            fr = ['forward', 'reverse'], output = config["output"], name = 'pretty_commune')
  threads: 
    config["threads"]
  run:
    shell("""python preprocess.py -i {reads} -t {threads} -o {output} 
            -adaptdir MOSCA/Databases/illumina_adapters -rrnadbs MOSCA/Databases/rRNA_databases""", 
            output = config["output"])

rule assembly:
  input:
    expand("{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}_paired.fq", 
            fr = ['forward', 'reverse'], output = config["output"], name = 'pretty_commune')
  output:
    expand("{output}/Assembly/{experiment[1][Sample]}/contigs.fasta", 
            output = config["output"], experiment = read_table(config["experiments"], "\t").iterrows())
  threads:
    config["threads"]
  run:
    reads = ",".join(map(str, input))
    shell("python assembly.py -r {input} -t {threads} -o {output}/Assembly/{sample} -a {assembler}", 
    output = config["output"], sample = config["experiments"][1]["sample"], 
    assembler = config["assembler"])

and this is the config.json

{
"output": 
    "snakemake_learn",
"threads":
    14,
"experiments":
    "samples.tsv",
"assembler":
    "metaspades"
    }

and I get the error

NameError in line 12 of /path/to/Snakefile:
name 'experiment' is not defined
  File "/path/to/Snakefile", line 12, in <module>

How is experiment not defined?


Solution

  • I'm not sure what you are trying to do but to me the line

    list(expand({experiment}[1]["MG files"], experiment = read_table(config["experiments"], "\t").iterrows()))[0]
    

    seems too complicated. I would put the sample sheet in a DataFrame and work with that rather using the iterator from read_table every time. E.g.:

    import pandas
    
    ss = pandas.read_csv(config["experiments"], sep= '\t')
    
    ...
    
    rule mg_preprocess:             # mt_preprocess make same, but data = mrna?
        input:
            expand('{experiment}', experiment= ss["MG files"][0]),
    

    Anyway, I think the error name 'experiment' is not defined is because {experiment} in list(expand({experiment}[1]["MG files"] is not in a string that can be replace with the actual value.


    EDIT:

    In fact the line expand('{experiment}', experiment= ss["MG files"][0]), doesn't make a lot of sense... It's just the same as ss["MG files"][0]