Search code examples
condasnakemake

Snakemake takes 90 s on WSL for test snakefile with very simple workload


Context

I'm exploring snakemake to define my data analysis as a DAG and evaluate it reproducibly. I installed it in a separate environment on WSL2 on windows 10. Before working on my actual project I wanted to try it on a simple project to get some feeling for how it works.

I wanted to test using the shell and python on 3 txt files (2 columns of numbers, comma separated). A first rule copies the files as .csv to a separate folder 'intermediate', through a shell command. A second rule then loads a python script to read in the intermediate file and graph it using matplotlib. I want to run the python script in a separate conda environment. This is not strictly necessary for what I want to do, but I can see the benefits of it.

Question

The DAG runs properly and outputs the files. However the rule involving conda/python takes 90 seconds. This seems unnecessary long, from command line I would expect this to run in a second or so. Do I do something wrong? Is something happening in the background that I'm not aware of? Even if it is necessary, I guess it would be easier to accept if I knew what is happening.

What I've tried so far

  • move project to WSL disk: first the project was running on the mounted c, but it turns out this is known to have speed issues. So the project now lives in a subdirectory of my home directory.
  • An env with numpy and matplotlib was created in .snakemake/conda/alphanumeric_env_name the first time I ran the snakefile. This took a while (a little longer than if you make the env manually with conda/mamba from the yml, as documented here) but the DAG worked with --use-conda. When running, the python rules still took long, and last thing the bash printed was Activating conda environment: ../.snakemake/conda/a060898bb3a415a46236eba6c4b6b5fa_ So I figured it was the activation that took long.
  • To check for this: installed numpy and pandas with conda in my snakemake env itself, and commented out the conda part of the rule. Using the snakemake environment itself is less ideal, but I could live with it. Still, very slow execution (see log below).

Code

My snakefile looks as follows.

workflow/snakefile

samples = "first_data,second_data,third_data"

rule all:
    input:
        expand("graphs/{file}.png", file=samples.split(",")),


rule make_intermediate:
    input:
        "data/{file}.txt",
    output:
        "intermediate/{file}_shell.csv",
    shell:
        "cp {input[0]} {output[0]}"


rule make_graph:
    input:
        "intermediate/{file}_shell.csv",
    output:
        "graphs/{file}.png",
    # conda:
    #    "../envs/data_env.yaml"
    script:
        "../scripts/a_script.py" 

with a_script.py

import matplotlib.pyplot as plt
import numpy as np

def make_graph(filename_in, filename_out):
    data = np.loadtxt(filename_in, delimiter=',')
    plt.figure()
    plt.title(filename_in)
    plt.plot(data[:, 0], data[:, 1])
    plt.savefig(filename_out)
    plt.close()
make_graph(snakemake.input[0], snakemake.output[0])

This is the relevant log (I do not use conda to open a separate environment here). with snakemake -c 4

Select jobs to execute...
Execute 3 jobs...

[Wed Jan 17 19:58:26 2024]
localrule make_graph:
    input: intermediate/second_data_shell.csv
    output: graphs/second_data.png
    jobid: 3
    reason: Missing output files: graphs/second_data.png
    wildcards: file=second_data
    resources: tmpdir=/tmp

[Wed Jan 17 19:58:26 2024]
localrule make_graph:
    input: intermediate/first_data_shell.csv
    output: graphs/first_data.png
    jobid: 1
    reason: Missing output files: graphs/first_data.png
    wildcards: file=first_data
    resources: tmpdir=/tmp

[Wed Jan 17 19:58:26 2024]
localrule make_graph:
    input: intermediate/third_data_shell.csv
    output: graphs/third_data.png
    jobid: 5
    reason: Missing output files: graphs/third_data.png
    wildcards: file=third_data
    resources: tmpdir=/tmp

[Wed Jan 17 20:00:39 2024]
Finished job 1.
1 of 4 steps (25%) done
[Wed Jan 17 20:00:39 2024]
Finished job 3.
2 of 4 steps (50%) done
[Wed Jan 17 20:00:39 2024]
Finished job 5.
3 of 4 steps (75%) done
Select jobs to execute...
Execute 1 jobs...

Thank you!


Solution

  • It turns out the problem was at an unexpected place: the python script worked fast enough when I just asked it to print it. The problem was with there being no graphical backend in WSL. Adding the following at the top of a_script.py solved the problem.

    import matplotlib
    matplotlib.use('Agg')