Search code examples
slurm

Slurm control daemon throws "unrecognized key: OverSubscribe" on startup. What's going on?


I'm trying to set up slurm to run on my uni's server so we can take turns running experiments (so people running longer experiments don't block people with shorter experiments from running their stuff until they're done)

Since the server is a single node I'm trying to set up a simple GANG scheduling system where programs take turns running. However on startup I get the following info when calling systemctl status slurmctld:

× slurmctld.service - Slurm controller daemon
     Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
     Active: failed (Result: exit-code) since Wed 2024-01-17 12:11:57 -03; 24s ago
       Docs: man:slurmctld(8)
    Process: 1288791 ExecStart=/usr/sbin/slurmctld -D -s $SLURMCTLD_OPTIONS (code=exited, status=1/FAILURE)
   Main PID: 1288791 (code=exited, status=1/FAILURE)
        CPU: 22ms

Jan 17 12:11:57 lab04 systemd[1]: Started Slurm controller daemon.
Jan 17 12:11:57 lab04 slurmctld[1288791]: slurmctld: error: _parse_next_key: Parsing error at unrecognized key: OverSubscribe
Jan 17 12:11:57 lab04 slurmctld[1288791]: error: _parse_next_key: Parsing error at unrecognized key: OverSubscribe
Jan 17 12:11:57 lab04 slurmctld[1288791]: fatal: Unable to process configuration file
Jan 17 12:11:57 lab04 slurmctld[1288791]: slurmctld: fatal: Unable to process configuration file
Jan 17 12:11:57 lab04 systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Jan 17 12:11:57 lab04 systemd[1]: slurmctld.service: Failed with result 'exit-code'.

Here's my slurm.conf:

ClusterName=localcluster
SlurmctldHost=localhost
ProctrackType=proctrack/linuxproc
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm/slurmctld
TaskPlugin=task/none
# TIMERS
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
# SCHEDULING
DefMemPerNode=245760
MaxMemPerNode=245760
#Seteado para que cada laburo alterne cada media hora
SchedulerTimeSlice=15
SchedulerType=sched/builtin
SelectType=select/linear
SelectTypeParameters=CR_Memory
PreemptMode=GANG
#Manejar cuantos laburos pueden correr en simultaneo, asumiendo que no pisan memoria
OverSubscribe=FORCE:6
# LOGGING AND ACCOUNTING
JobCompType=jobcomp/none
JobAcctGatherFrequency=30
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
# COMPUTE NODES
NodeName=localhost CPUs=2 State=UNKNOWN
PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP

Solution

  • The OverSubscribe parameter is a property of partitions, it must appear in the line where a partition is defined:

    PartitionName=LocalQ Nodes=ALL Default=YES MaxTime=INFINITE State=UP OverSubscribe=FORCE:6
                                                                         ^^^^^^^^^^^^^^^^^^^^^
    

    not in the body of the configuration file.