I'm trying to understand how much memory a simple job consumes (based on this python script). I run the same script on a single worker with different numbers of cores allocated.
The results of this little test are:
$ sacct -j 875,876,877,878,879,880 --format=JobID,reqcpus,elapsed,MaxRSS,maxvmsize
JobID ReqCPUS Elapsed MaxRSS MaxVMSize
------------ -------- ---------- ---------- ----------
875 1 00:03:24
875.batch 1 00:03:24 33584K 254884K
876 2 00:01:52
876.batch 2 00:01:52 43560K 274124K
877 4 00:01:09
877.batch 4 00:01:09 66672K 311580K
878 8 00:00:38
878.batch 8 00:00:38 111636K 385468K
879 16 00:00:20
879.batch 16 00:00:20 1308K 79660K
880 32 00:00:11
880.batch 32 00:00:11 1488K 79792K
Which is mostly fine, I would expect that if I use more cores the time would be shorter. What I don't quite understand is why MaxRSS and MaxVMSize increase with increasing number of cores but then suddenly drop.
Does anyone have an idea what is going on here?
The fact that memory usage increases with the number of CPUs is expected as the multiprocessing
packages relies on forking, which duplicates memory in most situations in a Python context due to reference counting, and the fact that multiprocessing
manages memory sharing by pickling information and sending copies of data by default.
The fact that it decreases when ReqCPUS>8
is most probably due to the fact that Slurm accounts memory usage at sampling intervals, which is 30
by default (check with scontrol show config|grep JobAcctGatherFrequency
) It seems that in your case, with ReqCPUS>8
, Elapsed
is <30s
and therefore you end up with a memory measurement which has happened at the very beginning of the jobs and is not representative of the actual usage.