Given a specific partition, I want to be able to limit the users' memory on running jobs.
I was able to define a QOS
Name MaxTRESPU Flags
---------- ------------- --------------------
normal
memlim mem=750000M DenyOnLimit
and was able to attach this QOS to the partition in question.
PartitionName=testpartition Nodes=node[01-03] MaxTime=INFINITE State=UP qos=memlim
This seems to work in limiting an end user to submit a job however it seems that any simple command like
srun -p testpartition hostname
will still give the following
# squeue -u testuser
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
38855732 nrpe hostname testuser PD 0:00 1 (QOSMaxMemoryPerUser)
So it seems like it's not tracking memory of running jobs but something else. Perhaps memory that's used over all time by the user?
I think the reason for the job being held in the pending state is because of the default --mem
value (DefMemPerNode
and MaxMemPerNode
). It is sometimes set as UNLIMITED
(Check scontrol show config | grep "Mem"
). So when you try to submit a Job with a QOS that violates the partition QOS (Higher Priority), it will be kept in pending state.
Try running your srun
command with the extra parameters:
srun -p testpartition --mem 70000M -N 1 hostname