Search code examples
condasnakemake

Best practices for (bio)conda versions in Snakemake wrappers?


What would be best environment.yml practices for specifying packages in Snakemake wrappers using conda? I understand that the channels should be:

channels:    
  - conda-forge
  - bioconda
  - base

However, what is a good choice for specifying packages? Do I specify no version? Full versions?

Using full versions has led to using infinite/super long conda environment resoultion problems before. However, not pinning versions gives the risk of implicitely upgrading to an incompatible version of a package.

Do I specify only direct dependencies or should I put the output of conda env export there so everything is frozen?


Solution

  • Edit: The recommendations below are now integrated into the contribution documentation for snakemake wrappers. In the future, they will be updated there:

    https://snakemake-wrappers.readthedocs.io/en/stable/contributing.html#environment-yaml-file

    For package version numbers, I would usually opt for pinning the major and minor version. This way, users will get the newest security patches and bug fixes whenever they create an environment, while nothing should change in a backward incompatible way (wherever developers properly follow semantic versioning). Also, an additional pinning file for package versions has recently been added as a requirement for wrappers, you can generate this with:

    snakedeploy pin-conda-envs environment.yaml
    

    Also, I would only specify direct dependencies and let the environment solver (and snakedeploy pin-conda-envs) handle any implicit dependencies. This provides a certain level of freedom to meet different needs for different packages, while usually the packages' recipes should specify any restrictions to particular versions.

    Another way to avoid (future) conflicts and keep environment creation quick, is to keep environments as small and granular as possible (see Johannes' comment below). If different rules share only some dependencies but not others, I would rather create separate minimal environments for each rule than reuse a bigger environment. Snakemake wrappers will do this anyways, as each wrapper has its own environment definition.

    As Johannes pointed out, the same applies to channels: Only specify channels that you are actually using and it is not necessary to specify the base channel any more. Quite the opposite, we suggest the following channels in this order (skip those that you do not need!):

    channels:
      - conda-forge
      - bioconda
      - nodefaults
    

    Nowadays, conda-forge should have packages for any necessary dependencies, and nodefaults avoids confusion and dependency resolution conflicts with the defaults or base channels.

    Also note, snakemake nowadays uses mamba as the default tool to do the (conda) environment solving -- it is usually much faster than conda and is better at ensuring that you get the most up to date version of packages.

    But, of course everything always depends. If you have known incompatibilities of versions that are not handled by the packages' recipes, specifying and pinning implicit dependencies can be necessary. If you have software that creates output which can change with a patch version, then you of course have to pin the patch version.