I am trying to tune my HPC cluster (I use Sparklyr) and I try to collect some important specs specified by http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/:
To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory.
namely:
I know how to use sinfo -n -l
but I see too many cores and I cannot easily get this information. Is there a simpler way to know the overall specs of my cluster?
Ultimately, I am trying to find some reasonable parameters for
--num-executors
--executor-cores
and --executor-memory
Number of nodes:
sinfo -O "nodes" --noheader
Number of cores: Slurm's "cores" are, by default, the number of cores per socket, not the total number of cores available on the node. Somewhat confusingly, in Slurm, cpus = cores * sockets (thus, a two-processor, 6-cores machine would have 2 sockets, 6 cores and 12 cpus).
Number of cores (=cpus in Slurm), disk space and RAM are more tricky to get, as it might be different on different nodes. The following returns an easy-to-parse list:
sinfo -N -O "nodehost,disk,memory,cpus" --noheader
If all nodes that are the same, we can get the info from the first row of sinfo:
Number of cores (=Slurm cpus) per node:
sinfo -N -O "cpus" --noheader | head -1
RAM per node:
sinfo -N -O "memory" --noheader | head -1
disk space per node:
sinfo -N -O "disk" --noheader | head -1