Search code examples
pythonbioinformaticssnakemake

Snakemake: Best practices in separating codebase from raw data and results?


I'm new to Snakemake I've made a pipeline that's very helpful to my group. So far I've run it one two batches of data. To do so, I created two separate directory clones of the following structure, one for each batch.

├── .gitignore
├── README.md
├── LICENSE.md
├── config.yaml
├── Snakefile
├── ANNOTATIONS/
├── DATA/
    ├── DATA_TYPE1/
    ├── DATA_TYPE2/
    └── DATA_TYPE3/
└── RESULTS/

This requires maintaining duplicate Snakefiles and auxillary scripts (as stored in SCRIPTS/), which causes an issue if tweaks are needed (as I would need to remember to propagate changes to version-controlled directory). Also this requires making a duplicate of the storage-intensive ANNOTATIONS/ directory, which is the larger problem.

I know there has to be a better way to do this, as this would not be scalable for a large number of batches or projects. For project management reasons I want to keep these batches separate.

My thought is I can keep the code (Snakefile and SCRIPTS/) and annotations (as stored in ANNOTATIONS/) in one directory, while keeping the data (and storing results) in a separate project directory, but setting workdir to this project directory either in the config file or when calling snakemake in the command line. Caveat being, any reference to the ANNOTATIONS folder in the Snakefile or or config file would need to be a full path to the Snakemake workflow directory instead of a relative path to the working directory. That bring a new challenge where this workflow can never be moved without fixing this full path wherever invoked.

I looked for guidance in https://snakemake.readthedocs.io/en/latest/snakefiles/deployment.html but it didn't seem to cover the advice I'm looking for.

How do others handle making their Snakemake workflow w/ necessary auxiliary data a separate maintainable directory, distinct from their input and output data? I think my initial thought will work (setting workdir while setting full path for ANNOTATIONS/ data) though it still feels a bit messy.


Solution

  • I use the --directory/-d CLI option. I typically setup my directories like so:

    .
    ├── my-snakemake-workflow/
    │   ├── .git
    │   ├── config/
    │   │   └── config.yaml
    │   └── workflow/
    │       ├── envs
    │       ├── rules
    │       ├── scripts
    │       └── Snakefile
    └── my-project/
        ├── my-data/
        │   ├── reads
        │   └── genome
        └── config/
            ├── config.yaml
            └── samples.csv
    

    Then to run the workflow my command line would look like: snakemake -s my-snakemake-workflow/workflow/Snakefile -d my-project/ <other options...>

    This keeps data and results out of the workflow git repo, as well as keep things more organized.