Search code examples
linuxslurm

How do I limit the amount of memory end users can use in total in their slurm jobs?


Given a specific partition, I want to be able to limit the users' memory on running jobs.

I was able to define a QOS

      Name     MaxTRESPU                Flags
---------- ------------- --------------------
    normal
    memlim   mem=750000M          DenyOnLimit

and was able to attach this QOS to the partition in question.

PartitionName=testpartition Nodes=node[01-03] MaxTime=INFINITE State=UP qos=memlim

This seems to work in limiting an end user to submit a job however it seems that any simple command like

srun -p testpartition hostname

will still give the following

# squeue -u testuser
             JOBID PARTITION     NAME       USER ST       TIME  NODES NODELIST(REASON)
          38855732      nrpe hostname   testuser PD       0:00      1 (QOSMaxMemoryPerUser)

So it seems like it's not tracking memory of running jobs but something else. Perhaps memory that's used over all time by the user?


Solution

  • I think the reason for the job being held in the pending state is because of the default --mem value (DefMemPerNode and MaxMemPerNode). It is sometimes set as UNLIMITED (Check scontrol show config | grep "Mem"). So when you try to submit a Job with a QOS that violates the partition QOS (Higher Priority), it will be kept in pending state.

    Try running your srun command with the extra parameters:

     srun -p testpartition --mem 70000M -N 1 hostname