I'm seting up a slurm cluster, it has 2 nodes for the test and I see
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
base* up 7-00:00:00 2 idle node[01-02]
I also share a folder between the two nodes
node1
drwxrwxrwx 2 nobody nogroup 4.0K Apr 26 18:27 shared
node 2
drwxrwxrwx 2 nobody nogroup 4.0K Apr 26 18:27 shared-node1
inside the shared folder in node1 (is the node from where I submit the jobs) I create a file echo.sh
-rwxrwxrwx 1 myname mygroup 137 Apr 26 18:11 echo.sh
and the code inside is
#!/usr/bin/env bash
#SBATCH --job-name=test
#SBATCH --cpus-per-task=4
#SBATCH --mem=1G
echo "test start"
sleep 180
echo "test end"
When I submit the job using a loop like
for ((x=1;x<1000;x++)); do sbatch echo.sh; done
The jobs submited on node2 die while on node1 are running. The dead jobs from node2 do not create a log file and I can not find the error.
What am I doing wrong? Thanks
The shared slurm.conf file, last lines
# Node
NodeName=node1 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=515619
NodeName=node2 CPUs=80 Boards=1 SocketsPerBoard=2 CoresPerSocket=20 ThreadsPerCore=2 RealMemory=515619
# Partition
PartitionName=base Nodes=node1,node2 Default=Yes MaxTime=7-00:00:00 Priority=1 State=UP
here the slurmd logs from the node2
task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 38676
task/affinity: batch_bind: job 38676 CPU input mask for node: 0x0000000000000000000F
task/affinity: batch_bind: job 38676 CPU final HW mask for node: 0x00000000030000000003
task/affinity: task_p_slurmd_batch_request: task_p_slurmd_batch_request: 38677
task/affinity: batch_bind: job 38677 CPU input mask for node: 0x000000000000000000F0
task/affinity: batch_bind: job 38677 CPU final HW mask for node: 0x000000000C000000000C
Launching batch job 38676 for UID 1000
Launching batch job 38677 for UID 1000
[38677.batch] error: Could not open stdout file /home/myname/shared/slurm-38677.out: No such file or directory
[38677.batch] error: IO setup failed: No such file or directory
[38677.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[38677.batch] done with job
[38677.extern] done with job
[38676.batch] error: Could not open stdout file /home/myname/shared/slurm-38676.out: No such file or directory
[38676.batch] error: IO setup failed: No such file or directory
[38676.batch] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status:256
[38676.batch] done with job
[38676.extern] done with job
This path (/home/myname/shared/) exist on node1 and it is shared with node2, as reported above, of course the slurm-xxx.out files were not created ... I was trying to check the "status:256" but I was not able to find a solution.
Just in my config file I had disactivated cgroup because I had a bug ... in case this help
#TaskPlugin=task/cgroup
TaskPlugin=task/affinity
Do I need to share the whole filesystem (/home) of the node1 (master) with the node2 ?
Slurm expects to have the same directory hierarchy as seen from both nodes.
So if you are exporting /home/myname/shared
from node1
to node2
, it has to be mounted as /home/myname/shared
on node2
. It cannot be named /home/myname/shared-node1/
You can try to rename the mount point on node2
, or create a symbolic link, on node2
:
ln -s /home/myname/shared-node1/ /home/myname/shared