I have developed MOSCA, a pipeline for meta-omics analysis (MG with MT), which is available through Bioconda. I want to convert it to snakemake, since it would easily allow MOSCA to run simultaneously some accesses through APIs and some computationally demanding tasks. Also, I think it would help better shape the tool to a standard format.
My question is, MOSCA has a lot of parameters, which will have to be transfered to a configuration file. While this is trivial for most parameters, inputting MG and MT files together is trickier. Also, MOSCA considers samples together. So I created a samples' file, samples.tsv
MG files MT files Sample
path/to/mg_R1.fastq,path/to/mg_R2.fastq path/to/mt_R1.fastq,path/to/mt_R2.fastq Sample
and I want the Snakefile to read it and retain the "Sample" information. Follows the example for including the MG preprocessing and the assembly of the pipeline, where the preprocess.py
and assembly.py
scripts contain the corresponding functionalities.
This is the Snakefile
from pandas import read_table
configfile: "config.json"
rule all:
input:
expand("{output}/Assembly/{experiment[1][Sample]}/contigs.fasta",
output = config["output"], experiment = read_table(config["experiments"], "\t").iterrows())
rule mg_preprocess: # mt_preprocess make same, but data = mrna?
input:
list(expand({experiment}[1]["MG files"], experiment = read_table(config["experiments"], "\t").iterrows()))[0]
output:
expand("{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}_paired.fq",
fr = ['forward', 'reverse'], output = config["output"], name = 'pretty_commune')
threads:
config["threads"]
run:
shell("""python preprocess.py -i {reads} -t {threads} -o {output}
-adaptdir MOSCA/Databases/illumina_adapters -rrnadbs MOSCA/Databases/rRNA_databases""",
output = config["output"])
rule assembly:
input:
expand("{output}/Preprocess/Trimmomatic/quality_trimmed_{name}{fr}_paired.fq",
fr = ['forward', 'reverse'], output = config["output"], name = 'pretty_commune')
output:
expand("{output}/Assembly/{experiment[1][Sample]}/contigs.fasta",
output = config["output"], experiment = read_table(config["experiments"], "\t").iterrows())
threads:
config["threads"]
run:
reads = ",".join(map(str, input))
shell("python assembly.py -r {input} -t {threads} -o {output}/Assembly/{sample} -a {assembler}",
output = config["output"], sample = config["experiments"][1]["sample"],
assembler = config["assembler"])
and this is the config.json
{
"output":
"snakemake_learn",
"threads":
14,
"experiments":
"samples.tsv",
"assembler":
"metaspades"
}
and I get the error
NameError in line 12 of /path/to/Snakefile:
name 'experiment' is not defined
File "/path/to/Snakefile", line 12, in <module>
How is experiment not defined?
I'm not sure what you are trying to do but to me the line
list(expand({experiment}[1]["MG files"], experiment = read_table(config["experiments"], "\t").iterrows()))[0]
seems too complicated. I would put the sample sheet in a DataFrame and work with that rather using the iterator from read_table
every time. E.g.:
import pandas
ss = pandas.read_csv(config["experiments"], sep= '\t')
...
rule mg_preprocess: # mt_preprocess make same, but data = mrna?
input:
expand('{experiment}', experiment= ss["MG files"][0]),
Anyway, I think the error name 'experiment' is not defined
is because {experiment}
in list(expand({experiment}[1]["MG files"]
is not in a string that can be replace with the actual value.
EDIT:
In fact the line expand('{experiment}', experiment= ss["MG files"][0]),
doesn't make a lot of sense... It's just the same as ss["MG files"][0]