I am using snakemake
to run a workflow. One of my rules uses the following yaml
file to create the Conda
env:
name: macs2
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- macs2=2.2.9.1
- homer=4.11
For the homer
package I need to install a data base, which I have included in the first line of the shell command of this rule:
rule annotate_peaks:
input:
"peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks.bed"
output:
"peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks_annotated.txt"
params:
genome=config["general"]["genome"],
regions=config["chip-seq"]["macs2"]["regions"],
threads: config["resources"]["macs2"]["cpu"]
resources:
runtime=config["resources"]["macs2"]["time"]
conda:
"envs/macs2.yaml"
shell:
"perl $CONDA_PREFIX/share/homer*/configureHomer.pl -install {params.genome}; "
"annotatePeaks.pl {input} {params.genome} > {output}"
The problem with this approach is that the first shell command will download the data for the chosen data base which are very large files and it takes a while to complete.
For each iteration of the rule this information will be downloaded again, which will slow down the analysis.
Is there a way to run perl $CONDA_PREFIX/share/homer*/configureHomer.pl -install {params.genome}
when snakemake
creates the "macs2" Conda
env so that this only has to be done once?
I presume that downloading script produces some output. I would split your rule into two rules, one specifically for downloading and the other only for running the annotation.
The downloading rule will have declared only the output directive, which would be the input for the second rule.
rule download_genome:
output:
directory("some_path/{genome}"), # the result of downloading
conda:
"envs/macs2.yaml"
shell:
"perl $CONDA_PREFIX/share/homer*/configureHomer.pl -install {wildcards.genome}; "
rule annotate_peaks:
input:
"peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks.bed",
os.path.join("some_path", config["general"]["genome"]"), # request to download if not present
output:
"peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks_annotated.txt"
params:
genome=config["general"]["genome"],
regions=config["chip-seq"]["macs2"]["regions"],
threads: config["resources"]["macs2"]["cpu"]
resources:
runtime=config["resources"]["macs2"]["time"]
conda:
"envs/macs2.yaml"
shell:
"annotatePeaks.pl {input} {params.genome} > {output}"
If there is not a simple way to obtain the path for the output of download rule, maybe you can use a fake flag file to mark that it was downloaded. You can use a simple bash && command to touch a fake file or you can also use snakemake touch directive https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#flag-files
rule download_genome_and_produce_flag:
output:
output: touch("downloaded_{genome}"),
conda:
"envs/macs2.yaml"
shell:
"perl $CONDA_PREFIX/share/homer*/configureHomer.pl -install {wildcards.genome}"
rule annotate_peaks:
input:
"peaks/{sample_ip}_vs_{sample_input}/{sample_ip}_peaks.bed",
"downloaded_" + config["general"]["genome"], # or refactor to a function
....