Search code examples
apache-sparkslurmsparklyr

obtaining number of nodes, number of codes and available RAM for tuning


I am trying to tune my HPC cluster (I use Sparklyr) and I try to collect some important specs specified by http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/:

To hopefully make all of this a little more concrete, here’s a worked example of configuring a Spark app to use as much of the cluster as possible: Imagine a cluster with six nodes running NodeManagers, each equipped with 16 cores and 64GB of memory.

namely:

  • number of nodes
  • number of cores
  • disk space and RAM

I know how to use sinfo -n -l but I see too many cores and I cannot easily get this information. Is there a simpler way to know the overall specs of my cluster?

Ultimately, I am trying to find some reasonable parameters for --num-executors --executor-cores and --executor-memory


Solution

  • Number of nodes:

    sinfo -O "nodes" --noheader
    

    Number of cores: Slurm's "cores" are, by default, the number of cores per socket, not the total number of cores available on the node. Somewhat confusingly, in Slurm, cpus = cores * sockets (thus, a two-processor, 6-cores machine would have 2 sockets, 6 cores and 12 cpus).

    Number of cores (=cpus in Slurm), disk space and RAM are more tricky to get, as it might be different on different nodes. The following returns an easy-to-parse list:

    sinfo -N -O "nodehost,disk,memory,cpus" --noheader
    

    If all nodes that are the same, we can get the info from the first row of sinfo:

    Number of cores (=Slurm cpus) per node:

    sinfo -N -O "cpus" --noheader | head -1
    

    RAM per node:

    sinfo -N -O "memory" --noheader | head -1
    

    disk space per node:

    sinfo -N -O "disk" --noheader | head -1