Search code examples
cluster-computingsnakemakehpcpbs

Snakemake remote rule stalling before executing script in PBS cluster


I have a snakemake (7.22.0) that's stalling after they start. I have rules that run on a cluster (through pbs) and execute an external Python script. I noticed that now some of the rules stall for very long before executing the script - the job starts, and snakemake outputs that it has started running, but then the actual script starts only 2hrs later. The output I get from the job is thus something like this:

[Tue Oct 15 23:13:13 2024]
rule ...:
    input: ...
    output: ...
    jobid: 0
    reason: Missing output files: ...
    wildcards: ...
    resources: mem_mb=1000, mem_mib=954, disk_mb=1000, disk_mib=954, tmpdir=/var/tmp/pbs.<job id>.<cluster name>

2024-10-16 01:21:37.393620 log from first line of the script
...
2024-10-16 01:21:41.212192 log from last line of the script (after reading large files) 
Not cleaning up <tmp script path>
[Wed Oct 16 01:21:41 2024]
Finished job 0.
1 of 1 steps (100%) done

Has anyone experienced something like this? What might snakemake be doing that might cause this? I'm generating lots of files in the workflow (only one in this job), so it's a suspect cause, but I don't entirely see how this might cause this. Also, the top-level "all" triggers many other rules (thousdands - but using a limit on the number of jobs submitted to pbs), and executing that takes ~20 minutes, but this is not the rule executing here. Other instances of the same rule execute normally too.

These are the statistics from pbs at some point during the job's execution, from a time before the external script started:

Job Id: ...
    Job_Name = snakejob....
    Job_Owner = ...
    resources_used.cpupercent = 4
    resources_used.cput = 00:00:44
    resources_used.mem = 231660kb
    resources_used.ncpus = 1
    resources_used.vmem = 977976kb
    resources_used.walltime = 00:54:14

The memory consumption seems excessive, I'm not sure? Is there something snakemake does on startup that can use so much memory (in extreme conditions, whatever they may be)?


Solution

  • The problem turned out to be that the directory workdir/.snakemake/scripts had become bogged down with many files (~600,000) from previous runs of the workflow. Deleting the old scripts there solved the problem.