I have a workflow where an earlier, faster, step produces a large intermediate output that is then consumed by a slower subsequent step. As an example, imagine a workflow that decompresses gzip files and recompresses them with bzip2.
Here's an untested example to illustrate my problem:
rule decompress:
input:
"gz/{dataset}.dat.gz"
output:
temp("decompress/{dataset}.dat")
shell:
"gunzip -c ${input} > ${output}"
rule compress:
input:
"decompress/{dataset}.dat"
output:
"bzip/{dataset}.dat.bz2"
shell:
"bzip2 -c ${input} > ${output}"
My problem is that the since the decompress
step runs faster than the second compress
, it tends to fill up my disk space with uncompressed files. I'm wondering, is there a way for me to limit the number (or size) or intermediate datasets in this scenario that are waiting to be processed by the latter, slower rule?
Cheers.
It's possible to specify arbitrary resources, for example:
rule decompress:
input:
"gz/{dataset}.dat.gz"
output:
temp("decompress/{dataset}.dat")
resources:
# this is arbitrary name to control how many
# jobs can run at the same time (will control
# by specifying resource availability/constraint)
limit_space=1,
shell:
"gunzip -c ${input} > ${output}"
rule compress:
input:
"decompress/{dataset}.dat"
output:
"bzip/{dataset}.dat.bz2"
resources:
# note that compress/decompress are now
# competing for the limited resource
limit_space=1,
priority:
# so specifying a higher priority for this
# rule will make sure that it runs before
# decompress rule
100,
shell:
"bzip2 -c ${input} > ${output}"
To specify the resource
constraint, you can use cli
:
snakemake --resources limit_space=6
There is more details in the docs on resources.
Update: as pointed out in the comments, the command above will initially launch 6 instances of decompress
rule and then as they are completed, launch new compress
instances one by one. The maximum number of executed decompress
instances is 6 (if there's more than one completed decompress instance, then the next freed up resource slot will be taken by compress
rule).
Also, using a dedicated resource could be useful when there are other rules competing for cores, so by having a dedicated resource, the constraint specific to decompress/compress is isolated.