Search code examples
pythonmakefiledirectorysymlinksnakemake

Symlink (auto-generated) directories via Snakemake


I am trying to create a symlink-directory structure for aliasing output directories in a Snakemake workflow.

Let's consider the following example:

A long time ago in a galaxy far, far away, somebody wanted to find the best ice cream flavour in the universe and conducted a survey. Our example workflow aims at representing the votes by a directory structure. The survey was conducted in English (because that's what they all speak in that foreign galaxy), but the results should be understood by non-English speakers as well. Symbolic links come to the rescue.

To make the input parsable for us humans as well as Snakemake, we stick them into a YAML file:

cat config.yaml
flavours:
  chocolate:
    - vader
    - luke
    - han
  vanilla:
    - yoda
    - leia
  berry:
    - windu
translations:
  french:
    chocolat: chocolate
    vanille: vanilla
    baie: berry
  german:
    schokolade: chocolate
    vanille: vanilla
    beere: berry

To create the corresponding directory tree, I started with this simple Snakefile:

### Setup ###

configfile: "config.yaml"


### Targets ###

votes = ["english/" + flavour + "/" + voter
         for flavour, voters in config["flavours"].items()
         for voter in voters]

translations = {language + "_translation/" + translation
                for language, translations in config["translations"].items()
                for translation in translations.keys()}


### Commands ###

create_file_cmd = "touch '{output}'"

relative_symlink_cmd = "ln --symbolic --relative '{input}' '{output}'"


### Rules ###

rule all:
    input: votes, translations

rule english:
    output: "english/{flavour}/{voter}"
    shell: create_file_cmd

rule translation:
    input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans]
    output: "{lang}_translation/{trans}"
    shell: relative_symlink_cmd

I am sure there ary more 'pythonic' ways to achieve what I wanted, but this is just a quick example to illustrate my problem.

Running the above workflow with snakemake, I get the following error:

Building DAG of jobs...
MissingInputException in line 33 of /tmp/snakemake.test/Snakefile
Missing input files for rule translation:
english/vanilla

So while Snakemake is clever enough to create the english/<flavour> directories when attempting to make an english/<flavour>/<voter> file, it seems to 'forget' about the existence of this directory when using it as an input to make a <language>_translation/<flavour> symlink.

As an intermediate step, I applied the following patch to the Snakefile:

27c27
<     input: votes, translations
---
>     input: votes#, translations

Now, the workflow ran through and created the english directory as expected (snakemake -q output only):

Job counts:
        count   jobs
        1       all
        6       english
        7

Now with the target directories created, I went back to the initial version of the Snakefile and re-ran it:

Job counts:
        count   jobs
        1       all
        6       translation
        7
ImproperOutputException in line 33 of /tmp/snakemake.test/Snakefile
Outputs of incorrect type (directories when expecting files or vice versa). Output directories must be flagged with directory(). for rule translation:
french_translation/chocolat
Exiting because a job execution failed. Look above for error message

While I am not sure if a symlink to a directory qualfies as a directory, I went ahead and applied a new patch to follow the suggestion:

35c35
<     output: "{lang}_translation/{trans}"
---
>     output: directory("{lang}_translation/{trans}")

With that, snakemake finally created the symlinks:

Job counts:
        count   jobs
        1       all
        6       translation
        7

As a confirmation, here is the resulting directory structure:

english
├── berry
│   └── windu
├── chocolate
│   ├── han
│   ├── luke
│   └── vader
└── vanilla
    ├── leia
    └── yoda
french_translation
├── baie -> ../english/berry
├── chocolat -> ../english/chocolate
└── vanille -> ../english/vanilla
german_translation
├── beere -> ../english/berry
├── schokolade -> ../english/chocolate
└── vanille -> ../english/vanilla

9 directories, 6 files

However, besides not being able to create this structure without running snakemake twice (and modifying the targets in between), even simply re-running the workflow results in an error:

Building DAG of jobs...
ChildIOException:
File/directory is a child to another output:
/tmp/snakemake.test/english/berry
/tmp/snakemake.test/english/berry/windu

running the translation rules again for no (good) reason:

Job counts:
        count   jobs
        1       all
        5       translation
        6

So my question is: How can I implement the above logic in a working Snakefile?

Note that I am not looking for advice to change the data representation in the YAML file and/or the Snakefile. This is just an example to highlight (and isolate) an issue I encountered in a more complex scenario.

Sadly, while I could not figure this out by myself so far, I managed to get a working GNU make version (even though the 'YAML parsing' is hackish at best):

### Setup ###

configfile := config.yaml


### Targets ###

votes := $(shell awk ' \
  NR == 1 { next } \
  /^[^ ]/ { exit } \
  NF == 1 { sub(":", "", $$1); dir = "english/" $$1 "/"; next } \
  { print dir $$2 } \
  ' '$(configfile)')

translations := $(shell awk ' \
  NR == 1 { next } \
  /^[^ ]/ { trans = 1; next } \
  ! trans { next } \
  { sub(":", "", $$1) } \
  NF == 1 { dir = $$1 "_translation/"; next } \
  { print dir $$1 } \
  ' '$(configfile)')


### Commands ###

create_file_cmd = touch '$@'

create_dir_cmd = mkdir --parent '$@'

relative_symlink_cmd = ln --symbolic --relative '$<' '$@'


### Rules ###

all : $(votes) $(translations)

$(sort $(dir $(votes) $(translations))) : % :
    $(create_dir_cmd)
$(foreach vote, $(votes), $(eval $(vote) : | $(dir $(vote))))
$(votes) : % :
    $(create_file_cmd)

translation_targets := $(shell awk ' \
  NR == 1 { next } \
  /^[^ ]/ { trans = 1; next } \
  ! trans { next } \
  NF != 1 { print "english/" $$2 "/"} \
  ' '$(configfile)')
define translation
$(word $(1), $(translations)) : $(word $(1), $(translation_targets)) | $(dir $(word $(1), $(translations)))
    $$(relative_symlink_cmd)
endef
$(foreach i, $(shell seq 1 $(words $(translations))), $(eval $(call translation, $(i))))

Running make on this works just fine:

mkdir --parent 'english/chocolate/'
touch 'english/chocolate/vader'
touch 'english/chocolate/luke'
touch 'english/chocolate/han'
mkdir --parent 'english/vanilla/'
touch 'english/vanilla/yoda'
touch 'english/vanilla/leia'
mkdir --parent 'english/berry/'
touch 'english/berry/windu'
mkdir --parent 'french_translation/'
ln --symbolic --relative 'english/chocolate/' 'french_translation/chocolat'
ln --symbolic --relative 'english/vanilla/' 'french_translation/vanille'
ln --symbolic --relative 'english/berry/' 'french_translation/baie'
mkdir --parent 'german_translation/'
ln --symbolic --relative 'english/chocolate/' 'german_translation/schokolade'
ln --symbolic --relative 'english/vanilla/' 'german_translation/vanille'
ln --symbolic --relative 'english/berry/' 'german_translation/beere'

The resulting tree is identical to the one shown above.

Also, running make again works as well:

make: Nothing to be done for 'all'.

So I really hope the solution is not to go back to old-fashioned GNU make with all the unreadable hacks I internalized over the years but that there is a way to convince Snakemake as well to do what I spelled out to do. ;-)

Just in case it is relevant: This was tested using Snakemake version 5.7.132.2.


edits:


Solution

  • I wanted to test with a newer version of Snakemake (5.20.1), and I came up with something similar to the answer proposed by Manalavan Gajapathy:

    ### Setup ###
    
    configfile: "config.yaml"
    
    VOTERS = list({voter for flavour in config["flavours"].keys() for voter in config["flavours"][flavour]})
    
    ### Targets ###
    
    votes = ["english/" + flavour + "/" + voter
             for flavour, voters in config["flavours"].items()
             for voter in voters]
    
    translations = {language + "_translation/" + translation
                    for language, translations in config["translations"].items()
                    for translation in translations.keys()}
    
    
    ### Commands ###
    
    create_file_cmd = "touch '{output}'"
    
    relative_symlink_cmd = "ln --symbolic --relative $(dirname '{input}') '{output}'"
    
    
    ### Rules ###
    
    rule all:
        input: votes, translations
    
    rule english:
        output: "english/{flavour}/{voter}"
        # To avoid considering ".done" as a voter
        wildcard_constraints:
            voter="|".join(VOTERS),
        shell: create_file_cmd
    
    def get_voters(wildcards):
        return [f"english/{wildcards.flavour}/{voter}" for voter in config["flavours"][wildcards.flavour]]
    
    rule flavour:
        input: get_voters
        output: "english/{flavour}/.done"
        shell: create_file_cmd
    
    rule translation:
        input: lambda wc: "english/" + config["translations"][wc.lang][wc.trans] + "/.done"
        output: directory("{lang}_translation/{trans}")
        shell: relative_symlink_cmd
    

    This runs and creates the desired output, but fails with ChildIOException when re-run (even if there would be nothing more to be done).