Search code examples
piperesourcesslurmsnakemake

How to set resources adequate for piped output in a snakemake workflow using --slurm?


I was successful in getting my sizable snakemake workflow to run on our slurm cluster using both --profile and --slurm together, but if I leave either out of the command, I get errors/failed jobs.

I'm working toward eliminating the need for the --profile option and run only with --slurm. I'm new at this. I fixed the first issue I ran into, but now I'm dealing with a "Not enough resources" error for 1 job and am a bit stuck.

Here's my current snakemake command that I'm using:

snakemake --use-conda --notemp --printshellcmds --directory .tests/test_1 --verbose --slurm --jobs unlimited --latency-wait 300

This is the rule that's failing:

rule summed_bedgraph_to_bigwig:
    input:
        bdg=rules.sort_summed_bedgraph.output,
        # None of the reference wildcards are used in the output, so they must all be expanded
        lengths=expand(
            rules.create_chromosome_lengths_file.output,
            refdir=config["reference_directory"],
            reference=config["reference_name"],
        ),
    output:
        temp("results/bigwigs/all_summed.bw"),
    log:
        std="results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout",
        err="results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr",
    conda:
        "../envs/bigwig_tools.yml"
    shell:
        "bedGraphToBigWig {input.bdg:q} {input.lengths:q} {output:q} 1> {log.std:q} 2> {log.err:q}"

And this is the failure I see in the console:

Error in rule summed_bedgraph_to_bigwig:
    message: SLURM-job '1152867' failed, SLURM status is: 'FAILED'
    jobid: 64
    input: results/bigwigs/all_sorted.bedGraph, input/reference/mm10.chr19.60m-end.chr_lengths.tsv
    output: results/bigwigs/all_summed.bw
    log: results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout, results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr, .snakemake/slurm_logs/rule_summed_bedgraph_to_bigwig/1152867.log (check log file(s) for error details)
    conda-env: /Genomics/argo/users/rleach/ATACCompendium/.tests/test_1/.snakemake/conda/f9952137484b0d8eeee5d0959433abf7_
    shell:
        bedGraphToBigWig results/bigwigs/all_sorted.bedGraph input/reference/mm10.chr19.60m-end.chr_lengths.tsv results/bigwigs/all_summed.bw 1> results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout 2> results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

This is the report of the submission of that job in the console (including the sbatch command from the --verbose flag):

rule summed_bedgraph_to_bigwig:
    input: results/bigwigs/all_sorted.bedGraph, input/reference/mm10.chr19.60m-end.chr_lengths.tsv
    output: results/bigwigs/all_summed.bw
    log: results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout, results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr
    jobid: 64
    reason: Missing output files: results/bigwigs/all_summed.bw; Input files updated by another job: results/bigwigs/all_sorted.bedGraph
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=<TBD>

bedGraphToBigWig results/bigwigs/all_sorted.bedGraph input/reference/mm10.chr19.60m-end.chr_lengths.tsv results/bigwigs/all_summed.bw 1> results/bigwigs/logs/summed_bedgraph_to_bigwig.stdout 2> results/bigwigs/logs/summed_bedgraph_to_bigwig.stderr
No wall time information given. This might or might not work on your cluster. If not, specify the resource runtime in your rule or as a reasonable default via --default-resources.
sbatch call: sbatch --job-name a93cfa76-5e46-4389-bf49-7860ef4442df -o .snakemake/slurm_logs/rule_summed_bedgraph_to_bigwig/%j.log --export=ALL -A main -p main --mem 1000 --cpus-per-task=1 -D /Genomics/argo/users/rleach/ATACCompendium/.tests/test_1 --wrap='/Genomics/argo/users/rleach/local/miniconda3/envs/ATACCD/bin/python3.11 -m snakemake --snakefile '"'"'/Genomics/argo/users/rleach/ATACCompendium/workflow/Snakefile'"'"' --target-jobs '"'"'summed_bedgraph_to_bigwig:'"'"' --allowed-rules '"'"'summed_bedgraph_to_bigwig'"'"' --cores '"'"'all'"'"' --attempt 1 --force-use-threads  --resources '"'"'mem_mb=1000'"'"' '"'"'mem_mib=954'"'"' '"'"'disk_mb=1000'"'"' '"'"'disk_mib=954'"'"' --wait-for-files '"'"'/Genomics/argo/users/rleach/ATACCompendium/.tests/test_1/.snakemake/tmp.x4tqnm8d'"'"' '"'"'results/bigwigs/all_sorted.bedGraph'"'"' '"'"'input/reference/mm10.chr19.60m-end.chr_lengths.tsv'"'"' '"'"'/Genomics/argo/users/rleach/ATACCompendium/.tests/test_1/.snakemake/conda/f9952137484b0d8eeee5d0959433abf7_'"'"' --force --keep-target-files --keep-remote --max-inventory-time 0 --nocolor --notemp --no-hooks --nolock --ignore-incomplete --rerun-triggers '"'"'params'"'"' '"'"'software-env'"'"' '"'"'code'"'"' '"'"'mtime'"'"' '"'"'input'"'"' --skip-script-cleanup  --use-conda  --conda-frontend '"'"'mamba'"'"' --conda-base-path '"'"'/Genomics/argo/users/rleach/local/miniconda3/envs/ATACCD'"'"' --wrapper-prefix '"'"'https://github.com/snakemake/snakemake-wrappers/raw/'"'"' --printshellcmds  --latency-wait 300 --scheduler '"'"'ilp'"'"' --scheduler-solver-path '"'"'/Genomics/argo/users/rleach/local/miniconda3/envs/ATACCD/bin'"'"' --default-resources '"'"'mem_mb=max(2*input.size_mb, 1000)'"'"' '"'"'disk_mb=max(2*input.size_mb, 1000)'"'"' '"'"'tmpdir=system_tmpdir'"'"' --directory '"'"'/Genomics/argo/users/rleach/ATACCompendium/.tests/test_1'"'"'  --slurm-jobstep --jobs 1 --mode 2'
Job 64 has been submitted with SLURM jobid 1152867 (log: .snakemake/slurm_logs/rule_summed_bedgraph_to_bigwig/1152867.log).

The stdout and stderr files are empty, but here's the slurm log:

$ cat .tests/test_1/.snakemake/slurm_logs/rule_summed_bedgraph_to_bigwig/1152867.log
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954
Select jobs to execute...
WorkflowError:
Error grouping resources in group '508bacf5-7b11-4068-96f3-6f6d733f5e32': Not enough resources were provided. This error is typically caused by a Pipe group requiring too many resources. Note that resources are summed across every member of the pipe group, except for ['runtime'], which is calculated via max(). Excess Resources:
    mem_mib: 1908/954
    mem_mb: 2000/1000
    _cores: 2/1

I'm confused by the tips in the workflow error. "Not enough resources" caused by a group "requiring too many resources"? I can't make that make sense in my head. Indeed, the rule that produces the output that this rule consumes is a part of a group automatically created by a pipe defined in the rule above it. For context, these are the rules just before the one with the error:

rule bigwigs_to_summed_bedgraph:
    input:
        expand(
            "results/bigwigs/{dataset}_peak_coverage.bw",
            dataset=DATASETS,
        ),
    output:
        pipe("results/bigwigs/all_summed.bedGraph"),
    params:
        nargs=len(DATASETS),
    log:
        std="results/bigwigs/logs/bigwigs_to_summed_bedgraph.stdout",
        err="results/bigwigs/logs/bigwigs_to_summed_bedgraph.stderr",
    conda:
        "../envs/bigwig_tools.yml"
    shell:
        # bigwigmerge requires a minimum of 2 files, but the original R script
        # supports 1, so to support 1 file here, we use bigwigtobedgraph
        """
        if [ {params.nargs:q} -eq 1 ]; then \
            bigWigToBedGraph {input:q} {output:q} 1> {log.std:q} 2> {log.err:q}; \
        else \
            bigWigMerge {input:q} {output:q} 1> {log.std:q} 2> {log.err:q}; \
        fi
        """


rule sort_summed_bedgraph:
    input:
        rules.bigwigs_to_summed_bedgraph.output,
    output:
        temp("results/bigwigs/all_sorted.bedGraph"),
    params:
        buffsize=lambda w, resources: resources.mem_mb - 1000,
    log:
        "results/bigwigs/logs/sort_summed_bedgraph.stderr",
    conda:
        "../envs/bigwig_tools.yml"
    resources:
        mem_mb=48000,
    shell:
        "sort -k1,1 -k2,2n --buffer-size={params.buffsize}M {input:q} 2> {log:q} 1> {output:q}"

I have tried setting --cores 12 and --local-cores 12, but I still get the same error.

What do I have to set on the command line to get past this error?

UPDATE: I have narrowed down the problem to a recent seemingly insignificant change I'd just made to rule (sort_summed_bedgraph) JUST before the rule with the error (summed_bedgraph_to_bigwig). I had added a param to be able to calculate the value I wanted to supply to --buffer-size. I just reverted that change, and the rules worked on our slurm cluster. This change makes it work:

rule sort_summed_bedgraph:
    input:
        rules.bigwigs_to_summed_bedgraph.output,
    output:
        temp("results/bigwigs/all_sorted.bedGraph"),
    # params:
    #     buffsize=lambda w, resources: resources.mem_mb - 1000,
    log:
        "results/bigwigs/logs/sort_summed_bedgraph.stderr",
    conda:
        "../envs/bigwig_tools.yml"
    resources:
        mem_mb=48000,
    shell:
        # "sort -k1,1 -k2,2n --buffer-size={params.buffsize}M {input:q} 2> {log:q} 1> {output:q}"
        "sort -k1,1 -k2,2n --buffer-size=47000M {input:q} 2> {log:q} 1> {output:q}"

But why does adding a param that uses a lambda to calculate its value utilizing the resources cause the error? Is this a bug or does it somehow make some sort of sense?


Solution

  • I believe that this is indeed a bug in snakemake. Setting the necessity of other conditions aside (e.g. pipe group, temp files, etc), the one thing that makes the rule following the pipe group not work, is the accessing of resources.mem_mb in the pipe groups' second rule's lambda in its params.

    Somehow, this causes the following individual rule to be designated as a group (with a different group ID from the preceding pipe group) with double the resource constraints. However, the job is submitted to the cluster as an individual job with an individual job's default resources.

    The resulting error from the snakemake code on the remote node says that the job needs twice the resources that the submission gave it, so it fails the job before it even starts.

    Since the point of using the lambda in the params was to only specify 1 memory value and compute a complementary memory value, until the bug is fixed, the bug can be worked around by not accessing resources.mem_mb in the params section. I did it by using a global variable and a function. And I decided to do it by adding instead of subtracting (so that a user couldn't edit in an invalid (i.e. <= 1000 value):

    SORT_BUFF_SIZE = 47000
    
    def get_sort_buffer_size(add=0):
        return SORT_BUFF_SIZE + add
    
    
    rule sort_summed_bedgraph:
        input:
            rules.bigwigs_to_summed_bedgraph.output,
        output:
            temp("results/bigwigs/all_sorted.bedGraph"),
        params:
            buffsize=get_sort_buffer_size(),
        log:
            "results/bigwigs/logs/sort_summed_bedgraph.stderr",
        conda:
            "../envs/bigwig_tools.yml"
        resources:
            mem_mb=get_sort_buffer_size(1000),
        shell:
            "sort -k1,1 -k2,2n --buffer-size={params.buffsize}M {input:q} 2> {log:q} 1> {output:q}"
    

    The other way to avoid the error of course is to simply hard code 2 different memory values (which is why my code worked before I added the params section to calculate the second memory value).