The snakefile
consists of two jobs - one downloads genome, the other uses bowtie2 to build
a bowtie2 index from the resulting .fa file. The code is below:
rule reference_genome_download:
output:
"reference_genome/ref_genome.fna"
shell:
"""
gca="GCA_000372685.2"
datasets download genome accession $gca --exclude-gff3 --exclude-protein --exclude-rna
unzip ncbi_dataset.zip
cat $(ls ncbi_dataset/data/$gca/chr*) > {output}
rm -r ncbi_dataset
rm README.md
rm ncbi_dataset.zip
"""
rule build_bowtie_index:
input:
"reference_genome/ref_genome.fna"
output:
"reference_genome/btbuild.log"
shell:
"bowtie2-build {input} reference_genome/ref_genome_btindex > {output}"
When I dry run it with snakemake -n -c 10
I get the following:
Building DAG of jobs...
Job stats:
job count min threads max threads
------------------------- ------- ------------- -------------
reference_genome_download 1 1 1
total 1 1 1
[Fri Jan 28 12:54:25 2022]
rule reference_genome_download:
output: reference_genome/ref_genome.fna
jobid: 0
resources: tmpdir=/tmp
The rule build_bowtie_index
doesn't even appear as a job option. How do I get the two to link?
With snakemake, I find more useful to think in terms of what I want at the end rather than in terms of a sequence of jobs. Snakemake looks at the first rule to establish what the user wants to produce and then uses the following rules to produce that output.
In your case, the first rule should be something like:
rule all:
input:
"reference_genome/btbuild.log",
meaning that at the end you want reference_genome/btbuild.log
. Snakemake will figure out that to produce reference_genome/btbuild.log
it needs to run reference_genome_download
first and then build_bowtie_index
. In fact, the order of the rules after the first doesn't even matter, snakemake will combine them in the right order by itself.