Search code examples
linuxresourcesslurmhpc

SLURM reported memory consumption


I'm trying to understand how much memory a simple job consumes (based on this python script). I run the same script on a single worker with different numbers of cores allocated.

The results of this little test are:

$ sacct -j 875,876,877,878,879,880 --format=JobID,reqcpus,elapsed,MaxRSS,maxvmsize
JobID         ReqCPUS    Elapsed     MaxRSS  MaxVMSize
------------ -------- ---------- ---------- ----------
875                 1   00:03:24
875.batch           1   00:03:24     33584K    254884K
876                 2   00:01:52
876.batch           2   00:01:52     43560K    274124K
877                 4   00:01:09
877.batch           4   00:01:09     66672K    311580K
878                 8   00:00:38
878.batch           8   00:00:38    111636K    385468K
879                16   00:00:20
879.batch          16   00:00:20      1308K     79660K
880                32   00:00:11
880.batch          32   00:00:11      1488K     79792K

Which is mostly fine, I would expect that if I use more cores the time would be shorter. What I don't quite understand is why MaxRSS and MaxVMSize increase with increasing number of cores but then suddenly drop.

Does anyone have an idea what is going on here?


Solution

  • The fact that memory usage increases with the number of CPUs is expected as the multiprocessing packages relies on forking, which duplicates memory in most situations in a Python context due to reference counting, and the fact that multiprocessing manages memory sharing by pickling information and sending copies of data by default.

    The fact that it decreases when ReqCPUS>8 is most probably due to the fact that Slurm accounts memory usage at sampling intervals, which is 30 by default (check with scontrol show config|grep JobAcctGatherFrequency) It seems that in your case, with ReqCPUS>8, Elapsed is <30s and therefore you end up with a memory measurement which has happened at the very beginning of the jobs and is not representative of the actual usage.