Search code examples
condasnakemakereproducible-research

How to trace-back exact software version(s) used to generate result-files in a snakemake workflow


Say I'm following the best practise workflow suggested for snakemake. Now I'd like to know how (i.e. which version) a given file, say plots/myplot.pdf, was generated. I found this surprisingly hard if not impossible only having the result folder at hand.

In more detail, say I was generated the results using. snakemake --use-conda --conda-prefix ~/.conda/myenvs which will resolve and download the conda-environments specified in the rule below (copied from the documentation):

rule NAME:
    input:
        "table.txt"
    output:
        "plots/myplot.pdf"
    conda:
        "envs/ggplot.yaml"
    script:
        "scripts/plot-stuff.R"

Say the content of envs/ggplot.yaml is the following:

channels:
  - conda-forge
dependencies:
  - r-ggplot2

After completion the ggplot environment will have been saved under say (note, the env name d2d1d57b assigned by snakemake automatically): ~/.conda/myevns/d2d1d57b

The problem is that if I ship the workflow subfolder e.g. as the result to someone else (or as supplement to a paper), I don't know what ggplot version was used for that run. All I know is the content of the yaml file (which is also reported when using --reports.). Also, since ggplot depends on other software, such as for instance R, I wouldn't know which R version was used for a given rule using this environment, since yaml file doesn't list indirect dependencies.

Ideally, I'd like want to have the complete environment software version shipped with the workflow results. As a workaround one could use conda env export name_of_env and copy the output in the result folder, but strangly conda list -n ~/.conda/myevns/d2d1d57b does not work ( due to error Characters not allowed: ('/', ' ', ':', '#'))

Creating a environment manually and inspecting indeed gives me (among other info):

r-base                    4.0.2                he766273_1    conda-forge
r-ggplot2                 3.3.2             r40h6115d3f_0    conda-forge

That's exactly what I'm after, but this of course would be too tedious manually.

This is also true when using wrappers as far as I can tell.

In summary, given a workflow or even for a given file within the workflow, how to trace back which exact software version(s) were used to generate it. Ideally, this information would be automatically shipped with the result of a workflow by default.

Maybe I'm even missing something very obvious, so hopefully someone can shed some light on this.

Update: issue was submitted


Solution

  • Based on our discussion in the comments, you could redirect your environment to a log file:

    rule NAME:
        input:
            "table.txt"
        output:
            "plots/myplot.pdf"
        log:
            "mylog.txt"
        conda:
            "envs/ggplot.yaml"
        shell:
            """
            conda env export > {log} 
            yourcode
            """
    

    However as you indicate this won't work if people do not use --use-conda, plus it is tedious to add this to each rule, so you could try something like this (not tested, might not work):

    if workflow.use_conda:
        shell.prefix("set -o pipefail; conda env export > {log}; ")
    

    Which adds the export to each shell command!

    Now if you use scripts, I am not so sure anymore how to continue. "easiest" might be to just call "conda env export" in a shell command inside python/R

    edit

    the shell prefix trick does not seem to work, so I striked through the text.