I work with 8 paired-end fastq files with 150 GB each, which need to be processed by a pipeline with space-demanding sub-tasks. I tried several options but I am still running out out disk space:
I use the following execution to limit my disk space usage to 500GB, but apparently this is not guaranteed and exceeds the 500GB. How to limit the disk usage to a fixed value to avoid running out of disk space ?
snakemake --resources disk_mb=500000 --use-conda --cores 16 -p
rule merge:
input:
fw="{sample}_1.fq.gz",
rv="{sample}_2.fq.gz",
output:
temp("{sample}.assembled.fastq")
resources:
disk_mb=100000
threads: 16
shell:
"""
merger-tool -f {input.fw} -r {input.rv} -o {output}
"""
rule filter:
input:
"{sample}.assembled.fastq"
output:
temp("{sample}.assembled.filtered.fastq")
resources:
disk_mb=100000
shell:
"""
filter-tool {input} {output}
"""
rule mapping:
input:
"{sample}.assembled.filtered.fastq"
output:
"{sample}_mapping_table.txt"
resources:
disk_mb=100000
shell:
"""
mapping-tool {input} {output}
"""
Snakemake
does not have the functionality to constrain resources, but can only schedule jobs in a way that respects resource constraints.
Now, snakemake
uses resources
to limit concurrent jobs, while your problem has a cumulative aspect to it. Taking a look at this answer, one way to resolve this is to introduce priority
, so that downstream tasks have highest priority.
In your particular file, it seems that adding priority
to the mapping
rule should be sufficient:
rule mapping:
input:
"{sample}.assembled.filtered.fastq"
output:
"{sample}_mapping_table.txt"
resources:
disk_mb=100_000
priority: 100
shell:
"""
mapping-tool {input} {output}
"""
You might also want to be careful about launching the rule initially (to avoid filling up the disk space with results of merge
).