SnakeMake rule with Python script, conda and cluster

I would like to get snakemake running a Python script with a specific conda environment via a SGE cluster.

On the cluster I have miniconda installed in my home directory. My home directory is mounted via NFS so accessible to all cluster nodes.

Because miniconda is in my home directory, the conda command is not on the operating system path by default. I.e., to use conda I need to first explicitly add this to the path.

I have a conda environment specification as a yaml file, which could be used with the --use-conda option. Will this work with the --cluster "qsub" option also?

FWIW I also launch snakemake using a conda environment (in fact the same environment I want to run the script).

Solution

I have an existing Snakemake system, running conda, on an SGE cluster. It's delightful and very do-able. I'll try to offer perspective and guidance.

The location of your miniconda, local or shared, may not matter. If you are using a login to access your cluster, you should be able to update your default variables upon logging in. This will have a global effect. If possible, I highly suggest editing the default settings in your .bashrc to accomplish this. This will properly, and automatically, setup your conda path upon login.

One of the lines in my file, "home/tboyarski/.bashrc"

 export PATH=$HOME/share/usr/anaconda/4.3.0/bin:$PATH

EDIT 1 Good point made in comment

Personally, I consider it good practice to put everything under conda control; however, this may not be ideal for users who commonly require access to software not supported by conda. Typically support issues have to do with using old operating systems (E.g. CentOS 5 support was recently dropped). As suggested in the comment, manually exporting the PATH variable in a single terminal session may be more ideal for users who do not work on pipelines exclusively, as this will not have a global effect.

With that said, like myself prior to Snakemake execution, I recommend initializing the conda environment used by the majority, or entirety of your pipeline. I find this the preferred way as it allows conda to create the environment, instead of getting Snakemake to ask conda to create the environment. I don't have the link for the web-dicussion, but I believe I read somewhere that individuals who only rely on Snakemake to create the environments, not lanching from a base environment, they found that the environments were being stored in the /.snakemake directory, and that it was getting excessively large. Feel free to look for the post. The issue was address by the author who reduced the load on the hidden folder, but still, I think it makes more sense to launch the jobs from an existing Snakemake environment, which interacts with your head node, and then passes the corresponding environmental variables to it's child nodes. I like a bit of hierarchy.

With that said, you will likely need to pass the environments to your child nodes if you are running Snakemake from your head node's environment and letting Snakemake interact with the SGE job scheduler, via qsub. I actually use the built-in DRMAA feature, which I highly recommend. Both submission mediums require me to provide the following arguments:

   -V     Available for qsub, qsh, qrsh with command and qalter.

         Specifies that all environment variables active within the qsub
          utility be exported to the context of the job.

Also...

  -S [[hostname]:]pathname,...
         Available for qsub, qsh and qalter.

         Specifies the interpreting shell for the job.  pathname must be
          an executable file which interprets command-line options -c and
          -s as /bin/sh does.

To give you a better starting point, I also specify virtual memory and core counts, this might be specific to my SGE system, I do not know.

-V -S /bin/bash -l h_vmem=10G -pe ncpus 1

I highly expect you'll require both arguments when submitting the the SGE cluster, as I do personally. I recommend putting your cluster submission variables in JSON format, in a separate file. The code snippet above can be found in this example of what I've done personally. I've organized it slightly differently than in the tutorial, but that's because I needed a bit more granularity.

Personally, I only use the --use-conda command when running a conda environment different than the one I used to launch and submit my Snakemake jobs. Example being, my main conda environment runs python 3, but if I need to use a tool that say, requires python 2, I will then and only then use Snakemake to launch a rule, with that specific environment, such that the execution of that rule uses a path corresponding to a python2 installation. This was of huge importance by my employer, as the existing system I was replacing struggled to seemlessly switch between python2 and 3, with conda and snakemake, this is very easy.

In principle I think this is good practice to launch a base conda environemnt, and to run Snakemake from there. It encourages the use of a single environment for the entire run. Keep it simple, right? Complicate things only when necessary, like when needing to run both python2 and python3 in the same pipeline. :)