python conda snakemake virtual-environment

Run command when creating Conda environment with Snakemake

I am using snakemake to run a workflow. One of my rules uses the following yaml file to create the Conda env:

name: macs2
channels: 
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - macs2=2.2.9.1
  - homer=4.11

For the homer package I need to install a data base, which I have included in the first line of the shell command of this rule:

rule annotate_peaks:
        input:
            "peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks.bed"
        output:
            "peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks_annotated.txt"
        params:
            genome=config["general"]["genome"],
            regions=config["chip-seq"]["macs2"]["regions"],
        threads: config["resources"]["macs2"]["cpu"]
        resources: 
            runtime=config["resources"]["macs2"]["time"]
        conda:
            "envs/macs2.yaml"
        shell:
            "perl $CONDA_PREFIX/share/homer*/configureHomer.pl -install {params.genome}; "
            "annotatePeaks.pl {input} {params.genome} > {output}"

The problem with this approach is that the first shell command will download the data for the chosen data base which are very large files and it takes a while to complete.

For each iteration of the rule this information will be downloaded again, which will slow down the analysis.

Is there a way to run perl $CONDA_PREFIX/share/homer*/configureHomer.pl -install {params.genome} when snakemake creates the "macs2" Conda env so that this only has to be done once?

Solution

I presume that downloading script produces some output. I would split your rule into two rules, one specifically for downloading and the other only for running the annotation.

The downloading rule will have declared only the output directive, which would be the input for the second rule.

rule download_genome:
        output:
            directory("some_path/{genome}"), # the result of downloading
        conda:
            "envs/macs2.yaml"
        shell:
            "perl $CONDA_PREFIX/share/homer*/configureHomer.pl -install {wildcards.genome}; "

rule annotate_peaks:
        input:
            "peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks.bed",
            os.path.join("some_path", config["general"]["genome"]"), # request to download if not present
        output:
            "peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks_annotated.txt"
        params:
            genome=config["general"]["genome"],
            regions=config["chip-seq"]["macs2"]["regions"],
        threads: config["resources"]["macs2"]["cpu"]
        resources: 
            runtime=config["resources"]["macs2"]["time"]
        conda:
            "envs/macs2.yaml"
        shell:
            "annotatePeaks.pl {input} {params.genome} > {output}"

If there is not a simple way to obtain the path for the output of download rule, maybe you can use a fake flag file to mark that it was downloaded. You can use a simple bash && command to touch a fake file or you can also use snakemake touch directive https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#flag-files

rule download_genome_and_produce_flag:
        output:
            output: touch("downloaded_{genome}"),
        conda:
            "envs/macs2.yaml"
        shell:
            "perl $CONDA_PREFIX/share/homer*/configureHomer.pl -install {wildcards.genome}"

rule annotate_peaks:
        input:
            "peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks.bed",
            "downloaded_" + config["general"]["genome"], # or refactor to a function
        ....