Search code examples
nodescluster-computingjobsslurm

Jobs die in the node


I'm seting up a slurm cluster, it has 2 nodes for the test and I see

PARTITION AVAIL   TIMELIMIT  NODES  STATE NODELIST
base*        up  7-00:00:00      2   idle node[01-02]

I also share a folder between the two nodes

node1

drwxrwxrwx  2 nobody  nogroup 4.0K Apr 26 18:27 shared

node 2

drwxrwxrwx  2 nobody  nogroup 4.0K Apr 26 18:27 shared-node1

inside the shared folder in node1 (is the node from where I submit the jobs) I create a file echo.sh

-rwxrwxrwx 1 myname mygroup  137 Apr 26 18:11 echo.sh

and the code inside is

#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --cpus-per-task=4
#SBATCH --mem=1G
echo "test start"
sleep 180
echo "test end"

When I submit the job using a loop like

for ((x=1;x<1000;x++)); do sbatch echo.sh; done

The jobs submited on node2 die while on node1 are running. The dead jobs from node2 do not create a log file and I can not find the error.

What am I doing wrong? Thanks

The shared slurm.conf file, last lines

# Node
NodeName=node1 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=515619
NodeName=node2 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=515619

# Partition
PartitionName=base Nodes=node1,node2 Default=Yes MaxTime=7-00:00:00 Priority=1 State=UP

here the slurmd logs from the node2

task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 38676
task/affinity: batch_bind: job 38676 CPU input mask for node: 0x0000000000000000000F
task/affinity: batch_bind: job 38676 CPU final HW mask for node: 0x00000000030000000003
task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 38677
task/affinity: batch_bind: job 38677 CPU input mask for node: 0x000000000000000000F0
task/affinity: batch_bind: job 38677 CPU final HW mask for node: 0x000000000C000000000C
Launching batch job 38676 for UID 1000
Launching batch job 38677 for UID 1000
[38677.batch] error: Could not open stdout file /home/myname/shared/slurm-38677.out: No such file or directory
[38677.batch] error: IO setup failed: No such file or directory
[38677.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[38677.batch] done with job
[38677.extern] done with job
[38676.batch] error: Could not open stdout file /home/myname/shared/slurm-38676.out: No such file or directory
[38676.batch] error: IO setup failed: No such file or directory
[38676.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[38676.batch] done with job
[38676.extern] done with job

This path (/home/myname/shared/) exist on node1 and it is shared with node2, as reported above, of course the slurm-xxx.out files were not created ... I was trying to check the "status:256" but I was not able to find a solution.

Just in my config file I had disactivated cgroup because I had a bug ... in case this help

#TaskPlugin=task/cgroup
TaskPlugin=task/affinity

Do I need to share the whole filesystem (/home) of the node1 (master) with the node2 ?


Solution

  • Slurm expects to have the same directory hierarchy as seen from both nodes.

    So if you are exporting /home/myname/shared from node1 to node2, it has to be mounted as /home/myname/shared on node2. It cannot be named /home/myname/shared-node1/

    You can try to rename the mount point on node2, or create a symbolic link, on node2:

    ln -s /home/myname/shared-node1/ /home/myname/shared