Search code examples
snakemake

limiting the number (or size) or outstanding intermediate outputs to be processed


I have a workflow where an earlier, faster, step produces a large intermediate output that is then consumed by a slower subsequent step. As an example, imagine a workflow that decompresses gzip files and recompresses them with bzip2.

Here's an untested example to illustrate my problem:

rule decompress:
  input:
    "gz/{dataset}.dat.gz"
  output:
    temp("decompress/{dataset}.dat")
  shell:
    "gunzip -c ${input} > ${output}"

rule compress:
  input:
    "decompress/{dataset}.dat"
  output:
    "bzip/{dataset}.dat.bz2"
  shell:
    "bzip2 -c ${input} > ${output}"

My problem is that the since the decompress step runs faster than the second compress, it tends to fill up my disk space with uncompressed files. I'm wondering, is there a way for me to limit the number (or size) or intermediate datasets in this scenario that are waiting to be processed by the latter, slower rule?

Cheers.


Solution

  • It's possible to specify arbitrary resources, for example:

    rule decompress:
      input:
        "gz/{dataset}.dat.gz"
      output:
        temp("decompress/{dataset}.dat")
      resources:
        # this is arbitrary name to control how many
        # jobs can run at the same time (will control
        # by specifying resource availability/constraint)
        limit_space=1,
      shell:
        "gunzip -c ${input} > ${output}"
    
    rule compress:
      input:
        "decompress/{dataset}.dat"
      output:
        "bzip/{dataset}.dat.bz2"
      resources:
        # note that compress/decompress are now
        # competing for the limited resource
        limit_space=1,
      priority:
        # so specifying a higher priority for this
        # rule will make sure that it runs before
        # decompress rule
        100,
      shell:
        "bzip2 -c ${input} > ${output}"
    

    To specify the resource constraint, you can use cli:

    snakemake --resources limit_space=6
    

    There is more details in the docs on resources.

    Update: as pointed out in the comments, the command above will initially launch 6 instances of decompress rule and then as they are completed, launch new compress instances one by one. The maximum number of executed decompress instances is 6 (if there's more than one completed decompress instance, then the next freed up resource slot will be taken by compress rule).

    Also, using a dedicated resource could be useful when there are other rules competing for cores, so by having a dedicated resource, the constraint specific to decompress/compress is isolated.