Understanding Parameters for Intel MKL LINPACK w/MPI `ppn` and `np`

Update 3

  • Changed NUMA_PER_MPI to 4
  • Changed P and Q to 1 and 2.

While the above config does launch and run, it is significantly outperformed by setting

  • P&Q to 2&4

which we do not understand.

Update 2

I discovered from this post that Intel's MP_LINPACK does not use openmp which explains why the openmp commands do not work.


Did some studying and I now understand Intel is running OpenMP under the hood. However, there are still several inconsistencies I don't understand.

I have updated everything to use the below config. As Intel's comments indicate I have set MPI_PER_NODE to two since each of my system's has two sockets and then I adjusted MPI_PROC_NUM to match.

The below config blows up with this error. Each host runs 116 threads (Not sure why but it always seems to have four extra threads)

In response to that I have tried:

  • Limited the threads with OMP_NUM_THREADS which appears to be completely ignored for reasons that aren't clear to me.
  • Limited the threads with HPL_NUMTHREADS which does limit the threads but doesn't fix the below problem.
  • Tried this with the below exports, but as far as I can tell that did nothing and the same error persisted.

Updated Run File

# Set total number of MPI processes for the HPL (should be equal to PxQ).
export MPI_PROC_NUM=4

# Set the MPI per node for each node.
# MPI_PER_NODE should be equal to 1 or number of sockets on the system.
# It will be same as -perhost or -ppn paramaters in mpirun/mpiexec.
export MPI_PER_NODE=2

# Set the number of NUMA nodes per MPI. (MPI_PER_NODE * NUMA_PER_MPI)
# should be equal to number of NUMA nodes on the system.
export NUMA_PER_MPI=8

# Following option is for Intel(R) Optimized HPL-AI Benchmark

# Comment in to enable Intel(R) Optimized HPL-AI Benchmark
# export USE_HPL_AI=1

# Following option is for Intel(R) Optimized HPL-AI Benchmark for GPU

# By default, Intel(R) Optimized HPL-AI Benchmark for GPU will use
# Bfloat16 matrix. If you prefer less iterations, you could choose
# float based matrix. But it will reduce maximum problem size. 
# export USE_BF16MAT=0

# Following options are for Intel(R) Distribution for LINPACK
# Benchmark for GPU and Intel(R) Optimized HPL-AI Benchmark for GPU

# Comment in to enable GPUs
# export USE_HPL_GPU=1

# Select backend driver for GPU (OpenCL ... 0, Level Zero ... 1)
# export HPL_DRIVER=0

# Number of stacks on each GPU
# export HPL_NUMSTACK=2

# Total number of GPUs on each node
# export HPL_NUMDEV=2


export OUT=xhpl_intel64_dynamic_outputs.txt

if [ -z ${USE_HPL_AI} ]; then
if [ -z ${USE_HPL_GPU} ]; then
export HPL_EXE=xhpl_intel64_dynamic
export HPL_EXE=xhpl_intel64_dynamic_gpu
if [ -z ${USE_HPL_GPU} ]; then
export HPL_EXE=xhpl-ai_intel64_dynamic
export HPL_EXE=xhpl-ai_intel64_dynamic_gpu

echo -n "This run was done on: "

# Capture some meaningful data for future reference:
echo -n "This run was done on: " >> $OUT
date >> $OUT
echo "HPL.dat: " >> $OUT
cat HPL.dat >> $OUT
echo "Binary name: " >> $OUT
ls -l ${HPL_EXE} >> $OUT
echo "This script: " >> $OUT
cat runme_intel64_dynamic >> $OUT
echo "Environment variables: " >> $OUT
env >> $OUT
echo "Actual run: " >> $OUT

# Environment variables can also be also be set on the Intel(R) MPI Library command
# line using the -genv option (to appear before the -np 1):

mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT

echo -n "Done: " >> $OUT
date >> $OUT

echo -n "Done: "

Updated HPL.dat

This is based on:

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
249984         Ns (this is an example; adjust based on the calculation above)
1            # of NBs
384          NBs (a common choice, but you might experiment with this)
1            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps (you could also try 8 for a different P x Q configuration)
2            Qs (correspondingly, this would be 4 if you chose P as 8)
1.0          threshold
1            # of panel fact
2 1 0        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1 0 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
0            SWAP (0=bin-exch,1=long,2=mix)
1            swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
1            U  in (0=transposed,1=no-transposed) form
0            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

Updated SLURM Job File

#SBATCH --job-name=linpack
#SBATCH --output=linpack_%j.out
#SBATCH --partition=8480             # Specify the partition
#SBATCH --nodes=2                    # Number of nodes to use
#SBATCH --ntasks-per-node=2       # Use all 112 cores on each node
#SBATCH --cpus-per-task=56
#SBATCH --time=0:10:00               # Time limit; adjust as needed
#SBATCH --exclusive                  # Request exclusive access to the nodes

# Load the required modules
module load intel/mpi/2019u12

# Change to the directory containing the LINPACK executable
cd /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack

# Run the LINPACK benchmark using the provided script
bash runme_intel64_dynamic

Original Problem

Myself and a few other people are trying to understand the parameters for Intel's instantiation for LINPACK. I have started testing on two nodes each with two 8480s (56 cores).

One of my colleagues figured out by trial and error that this works:

mpirun -perhost 2 -np 8 ./runme_intel64_prv -n 117120 -b 384 -p 2 -q 4

What we don't understand is what that means.

  • perhost (same as ppn) is "processes per node"
  • np is "number of processes"

That somehow ends up running 60 threads per node and I haven't the vaguest idea why or how. What I found even more baffling is that if I change perhost to 8 it then produces 120 threads per node. This seems to run counter to the documentation I find where one expects perhost to be the number of processes per node, but I cannot figure out how 8 and 2 gets you 60. I came up with that by running ps -efT | grep xhpl on a node which got me 60 instances of:

grant    166328 166328 166325 97 12:23 ?        00:03:01 ./xhpl_intel64_dynamic -n 117120 -b 384 -p 2 -q 4

I looked at Intel's docs and they essentially say nothing about this.

I am launching it via SLURM with:

#SBATCH --job-name=linpack
#SBATCH --output=linpack_%j.out
#SBATCH --partition=8480    # Specify your partition
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks-per-node=8          # Number of tasks (MPI processes) per node
#SBATCH --time=1:00:00                # Time limit in the format hours:minutes:seconds

# Load the required modules
module load intel/oneAPI/2023.0.0
module load compiler-rt/2023.0.0 mkl/2023.0.0 mpi/2021.8.0

# Navigate to the directory containing your HPL files
cd /home/grant/mp_linpack

# Run the HPL benchmark
bash runme_intel64_dynamic

I also see some people using NUMA_PER_MPI but I couldn't find a clear answer on what that does if anything.

Full Server Specs

Bottom Line

Given the above server specs, I'm trying to figure out the values of:

  • PxQ (I would have thought something like 14x15 since we want it split across two servers with 224 total cores, but that blows up with a series of errors)
  • What parameters I should pass in to mpirun to fully utilize all CPU cores
  • Some basic explanation of how those mpirun parameters correspond to threads/processes across the cluster


  • I spent a very long time figuring all this out and have written a full write up here on how Intel MKL works, how the math works, and how to optimize it.

    The TLDR answer to this question:

    Running under a job manager is a bit of a mess because you basically have to go reverse engineer the binary mpirun to see how the environment variables from SLURM in this case get used. I recommend not running with SLURM or any other job manager. You can do it, but you'll need to go look at exactly how MKL is using those environment variables. I talk about this in the guide.

    Here is the snippit of my guide that directly answers this question:

    If you're like me, you Google'd LINPACK and how to optimize. If you are also like me you landed on a bunch of open source documentation about how to tune it. Forget everything you read because that's not how the Intel version works.

    Inside of runme_intel64_dynamic you will see the following values:

    # Set total number of MPI processes for the HPL (should be equal to PxQ).
    export MPI_PROC_NUM=2
    # Set the MPI per node for each node.
    # MPI_PER_NODE should be equal to 1 or number of sockets on the system.
    # It will be same as -perhost or -ppn paramaters in mpirun/mpiexec.
    export MPI_PER_NODE=2
    # Set the number of NUMA nodes per MPI. (MPI_PER_NODE * NUMA_PER_MPI)
    # should be equal to number of NUMA nodes on the system.
    export NUMA_PER_MPI=1

    These are the heart of Intel's benchmark. There is more detail on the entire process flow for intel in the README. Here is what each value does and how to tune it. For the sake of example, let's say your setup is 3 servers, each with 112 cores and 8 NUMA domains a piece.

    • MPI_PROC_NUM: This is the total number of MPI processes that will run cluster wide. In my testing, I found that best performance was to have one MPI process per NUMA domain available. So if each server has 8, you would want 24 MPI processes total.
    • MPI_PER_NODE: This controls how many MPI processes are allocated to each node in a round robin fashion. When I say round robin, I mean you could do something like set this in our example to 4 and what MPI would do is assign the first 4 MPI processes to node 1, 4 after to node 2, another 4 to node 3, and then it would round robin back to node 1 and add 4 again to each to get the full 24 processes. In my testing this did not yield optimal results. What is better is to simply set MPI_PER_NODE to the number of NUMA domains per server so in our case 8.
    • NUMA_PER_MPI: This dictates how many NUMA domains each MPI process will span. If it is set to 1, each MPI process will only operate on a single NUMA domain, 2 each MPI process will span two NUMA domains. I show what this looks like in gory detail in the runme_intel64_prv section of the README. My best results were with this set to 1 and then I just had each MPI process bind to a single NUMA domain.

    CRITICAL You must heed the comment "Set the number of NUMA nodes per MPI. (MPI_PER_NODE * NUMA_PER_MPI) should be equal to number of NUMA nodes on the system.". Whatever config you test, MPI_PER_NODE*NUMA_PER_MPI=<number of NUMA nodes (on a single system)>. While it would be suboptimal you can have MPI_PER_NODE*NUMA_PER_MPI<number of system NUMA nodes. That config will run. However, if MPI_PER_NODE*NUMA_PER_MPI>number of system NUMA nodes, the test will fail out with memory errors.

    Finally, you will need to set p and q. Since Intel spawns threads automatically, p and q for Intel's variant is nothing like it is for regular MPI. This is one part with which you may want to experiment. The one key is that $p \times q = \text{MPI_PROC_NUM}$. I generally find best performance is when the numbers are as close as they can be. In our example, this with be $4 \times 6$ or maybe $2 \times 12$.