I'm exploring snakemake to define my data analysis as a DAG and evaluate it reproducibly. I installed it in a separate environment on WSL2 on windows 10. Before working on my actual project I wanted to try it on a simple project to get some feeling for how it works.
I wanted to test using the shell and python on 3 txt files (2 columns of numbers, comma separated). A first rule copies the files as .csv to a separate folder 'intermediate', through a shell command. A second rule then loads a python script to read in the intermediate file and graph it using matplotlib. I want to run the python script in a separate conda environment. This is not strictly necessary for what I want to do, but I can see the benefits of it.
The DAG runs properly and outputs the files. However the rule involving conda/python takes 90 seconds. This seems unnecessary long, from command line I would expect this to run in a second or so. Do I do something wrong? Is something happening in the background that I'm not aware of? Even if it is necessary, I guess it would be easier to accept if I knew what is happening.
Activating conda environment: ../.snakemake/conda/a060898bb3a415a46236eba6c4b6b5fa_
So I figured it was the activation that took long.My snakefile looks as follows.
workflow/snakefile
samples = "first_data,second_data,third_data"
rule all:
input:
expand("graphs/{file}.png", file=samples.split(",")),
rule make_intermediate:
input:
"data/{file}.txt",
output:
"intermediate/{file}_shell.csv",
shell:
"cp {input[0]} {output[0]}"
rule make_graph:
input:
"intermediate/{file}_shell.csv",
output:
"graphs/{file}.png",
# conda:
# "../envs/data_env.yaml"
script:
"../scripts/a_script.py"
with a_script.py
import matplotlib.pyplot as plt
import numpy as np
def make_graph(filename_in, filename_out):
data = np.loadtxt(filename_in, delimiter=',')
plt.figure()
plt.title(filename_in)
plt.plot(data[:, 0], data[:, 1])
plt.savefig(filename_out)
plt.close()
make_graph(snakemake.input[0], snakemake.output[0])
This is the relevant log (I do not use conda to open a separate environment here). with snakemake -c 4
Select jobs to execute...
Execute 3 jobs...
[Wed Jan 17 19:58:26 2024]
localrule make_graph:
input: intermediate/second_data_shell.csv
output: graphs/second_data.png
jobid: 3
reason: Missing output files: graphs/second_data.png
wildcards: file=second_data
resources: tmpdir=/tmp
[Wed Jan 17 19:58:26 2024]
localrule make_graph:
input: intermediate/first_data_shell.csv
output: graphs/first_data.png
jobid: 1
reason: Missing output files: graphs/first_data.png
wildcards: file=first_data
resources: tmpdir=/tmp
[Wed Jan 17 19:58:26 2024]
localrule make_graph:
input: intermediate/third_data_shell.csv
output: graphs/third_data.png
jobid: 5
reason: Missing output files: graphs/third_data.png
wildcards: file=third_data
resources: tmpdir=/tmp
[Wed Jan 17 20:00:39 2024]
Finished job 1.
1 of 4 steps (25%) done
[Wed Jan 17 20:00:39 2024]
Finished job 3.
2 of 4 steps (50%) done
[Wed Jan 17 20:00:39 2024]
Finished job 5.
3 of 4 steps (75%) done
Select jobs to execute...
Execute 1 jobs...
Thank you!
It turns out the problem was at an unexpected place: the python script worked fast enough when I just asked it to print it. The problem was with there being no graphical backend in WSL. Adding the following at the top of a_script.py
solved the problem.
import matplotlib
matplotlib.use('Agg')