Search code examples
mpibenchmarkingintel-mkllinpack

Understanding Parameters for Intel MKL LINPACK w/MPI `ppn` and `np`


Update 3

  • Changed NUMA_PER_MPI to 4
  • Changed P and Q to 1 and 2.

While the above config does launch and run, it is significantly outperformed by setting

  • MPI_PROC_NUM=8
  • MPI_PER_NODE=2
  • NUMA_PER_MPI=1
  • P&Q to 2&4

which we do not understand.

Update 2

I discovered from this post that Intel's MP_LINPACK does not use openmp which explains why the openmp commands do not work.

Update

Did some studying and I now understand Intel is running OpenMP under the hood. However, there are still several inconsistencies I don't understand.

I have updated everything to use the below config. As Intel's comments indicate I have set MPI_PER_NODE to two since each of my system's has two sockets and then I adjusted MPI_PROC_NUM to match.

The below config blows up with this error. Each host runs 116 threads (Not sure why but it always seems to have four extra threads)

HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
HPL[   1, z5-01] Failed memory mapping : NodeMask =
HPL[   3, z5-02] Failed memory mapping : NodeMask =
[z5-01:111292:0:111292] Caught signal 8 (Floating point exception: integer divide by zero)
[z5-02:93814:0:93814] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:  93814) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x0000000000024531 ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
 2 0x00000000000150ea ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
 3 0x000000000003e935 ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
 4 0x000000000010cd0d ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
 5 0x000000000003ad85 __libc_start_main()  ???:0
 6 0x000000000000d92e ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
=================================
==== backtrace (tid: 111292) ====
 0 0x0000000000012cf0 __funlockfile()  :0
 1 0x00000000000244c2 ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
 2 0x00000000000150ea ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
 3 0x000000000003e935 ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
 4 0x000000000010cd0d ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
 5 0x000000000003ad85 __libc_start_main()  ???:0
 6 0x000000000000d92e ???()  /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack/xhpl_intel64_dynamic:0
=================================
./runme_intel64_prv: line 70: 93814 Floating point exception(core dumped) ./${HPL_EXE} "$@"
./runme_intel64_prv: line 70: 111292 Floating point exception(core dumped) ./${HPL_EXE} "$@"

In response to that I have tried:

  • Limited the threads with OMP_NUM_THREADS which appears to be completely ignored for reasons that aren't clear to me.
  • Limited the threads with HPL_NUMTHREADS which does limit the threads but doesn't fix the below problem.
  • Tried this with the below exports, but as far as I can tell that did nothing and the same error persisted.
export MKL_NUM_THREADS=4
export MKL_DOMAIN_NUM_THREADS="MKL_BLAS=4"
export OMP_NUM_THREADS=1
export MKL_DYNAMIC="FALSE"
export OMP_DYNAMIC="FALSE"

Updated Run File

#!/bin/bash
#===============================================================================
# Copyright 2001-2023 Intel Corporation.
#
# This software and the related documents are Intel copyrighted  materials,  and
# your use of  them is  governed by the  express license  under which  they were
# provided to you (License).  Unless the License provides otherwise, you may not
# use, modify, copy, publish, distribute,  disclose or transmit this software or
# the related documents without Intel's prior written permission.
#
# This software and the related documents  are provided as  is,  with no express
# or implied  warranties,  other  than those  that are  expressly stated  in the
# License.
#===============================================================================

# Set total number of MPI processes for the HPL (should be equal to PxQ).
export MPI_PROC_NUM=4

# Set the MPI per node for each node.
# MPI_PER_NODE should be equal to 1 or number of sockets on the system.
# It will be same as -perhost or -ppn paramaters in mpirun/mpiexec.
export MPI_PER_NODE=2

# Set the number of NUMA nodes per MPI. (MPI_PER_NODE * NUMA_PER_MPI)
# should be equal to number of NUMA nodes on the system.
export NUMA_PER_MPI=8

#====================================================================
# Following option is for Intel(R) Optimized HPL-AI Benchmark
#====================================================================

# Comment in to enable Intel(R) Optimized HPL-AI Benchmark
# export USE_HPL_AI=1

#====================================================================
# Following option is for Intel(R) Optimized HPL-AI Benchmark for GPU
#====================================================================

# By default, Intel(R) Optimized HPL-AI Benchmark for GPU will use
# Bfloat16 matrix. If you prefer less iterations, you could choose
# float based matrix. But it will reduce maximum problem size. 
# export USE_BF16MAT=0

#====================================================================
# Following options are for Intel(R) Distribution for LINPACK
# Benchmark for GPU and Intel(R) Optimized HPL-AI Benchmark for GPU
#====================================================================

# Comment in to enable GPUs
# export USE_HPL_GPU=1

# Select backend driver for GPU (OpenCL ... 0, Level Zero ... 1)
# export HPL_DRIVER=0

# Number of stacks on each GPU
# export HPL_NUMSTACK=2

# Total number of GPUs on each node
# export HPL_NUMDEV=2

#====================================================================

export OUT=xhpl_intel64_dynamic_outputs.txt

if [ -z ${USE_HPL_AI} ]; then
if [ -z ${USE_HPL_GPU} ]; then
export HPL_EXE=xhpl_intel64_dynamic
else
export HPL_EXE=xhpl_intel64_dynamic_gpu
fi
else
if [ -z ${USE_HPL_GPU} ]; then
export HPL_EXE=xhpl-ai_intel64_dynamic
else
export HPL_EXE=xhpl-ai_intel64_dynamic_gpu
fi
fi

echo -n "This run was done on: "
date

# Capture some meaningful data for future reference:
echo -n "This run was done on: " >> $OUT
date >> $OUT
echo "HPL.dat: " >> $OUT
cat HPL.dat >> $OUT
echo "Binary name: " >> $OUT
ls -l ${HPL_EXE} >> $OUT
echo "This script: " >> $OUT
cat runme_intel64_dynamic >> $OUT
echo "Environment variables: " >> $OUT
env >> $OUT
echo "Actual run: " >> $OUT

# Environment variables can also be also be set on the Intel(R) MPI Library command
# line using the -genv option (to appear before the -np 1):

mpirun -perhost ${MPI_PER_NODE} -np ${MPI_PROC_NUM} ./runme_intel64_prv "$@" | tee -a $OUT

echo -n "Done: " >> $OUT
date >> $OUT

echo -n "Done: "
date

Updated HPL.dat

This is based on: https://researchcomputing.princeton.edu/faq/how-to-use-openmpi-with-o

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
249984         Ns (this is an example; adjust based on the calculation above)
1            # of NBs
384          NBs (a common choice, but you might experiment with this)
1            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps (you could also try 8 for a different P x Q configuration)
2            Qs (correspondingly, this would be 4 if you chose P as 8)
1.0          threshold
1            # of panel fact
2 1 0        PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
2            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1 0 2        RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
0            DEPTHs (>=0)
0            SWAP (0=bin-exch,1=long,2=mix)
1            swapping threshold
1            L1 in (0=transposed,1=no-transposed) form
1            U  in (0=transposed,1=no-transposed) form
0            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)

Updated SLURM Job File

#!/bin/bash
#SBATCH --job-name=linpack
#SBATCH --output=linpack_%j.out
#SBATCH --partition=8480             # Specify the partition
#SBATCH --nodes=2                    # Number of nodes to use
#SBATCH --ntasks-per-node=2       # Use all 112 cores on each node
#SBATCH --cpus-per-task=56
#SBATCH --time=0:10:00               # Time limit; adjust as needed
#SBATCH --exclusive                  # Request exclusive access to the nodes

# Load the required modules
module load intel/mpi/2019u12

# Change to the directory containing the LINPACK executable
cd /home/grant/benchmarks_2024.0/linux/share/mkl/benchmarks/mp_linpack

# Run the LINPACK benchmark using the provided script
bash runme_intel64_dynamic

Original Problem


Myself and a few other people are trying to understand the parameters for Intel's instantiation for LINPACK. I have started testing on two nodes each with two 8480s (56 cores).

One of my colleagues figured out by trial and error that this works:

mpirun -perhost 2 -np 8 ./runme_intel64_prv -n 117120 -b 384 -p 2 -q 4

What we don't understand is what that means.

  • perhost (same as ppn) is "processes per node"
  • np is "number of processes"

That somehow ends up running 60 threads per node and I haven't the vaguest idea why or how. What I found even more baffling is that if I change perhost to 8 it then produces 120 threads per node. This seems to run counter to the documentation I find where one expects perhost to be the number of processes per node, but I cannot figure out how 8 and 2 gets you 60. I came up with that by running ps -efT | grep xhpl on a node which got me 60 instances of:

grant    166328 166328 166325 97 12:23 ?        00:03:01 ./xhpl_intel64_dynamic -n 117120 -b 384 -p 2 -q 4

I looked at Intel's docs and they essentially say nothing about this.

I am launching it via SLURM with:

#!/bin/bash
#SBATCH --job-name=linpack
#SBATCH --output=linpack_%j.out
#SBATCH --partition=8480    # Specify your partition
#SBATCH --nodes=2                     # Number of nodes
#SBATCH --ntasks-per-node=8          # Number of tasks (MPI processes) per node
#SBATCH --time=1:00:00                # Time limit in the format hours:minutes:seconds

# Load the required modules
module load intel/oneAPI/2023.0.0
module load compiler-rt/2023.0.0 mkl/2023.0.0 mpi/2021.8.0

# Navigate to the directory containing your HPL files
cd /home/grant/mp_linpack

# Run the HPL benchmark
bash runme_intel64_dynamic

I also see some people using NUMA_PER_MPI but I couldn't find a clear answer on what that does if anything.

Full Server Specs

Gathering system information for HPL configuration...
Memory Information:
              total        used        free      shared  buff/cache   available
Mem:          503Gi       2.9Gi       498Gi       1.8Gi       2.2Gi       496Gi
Swap:          11Gi          0B        11Gi

CPU Information:
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              112
On-line CPU(s) list: 0-111
Thread(s) per core:  1
Core(s) per socket:  56
Socket(s):           2
NUMA node(s):        8
Vendor ID:           GenuineIntel
CPU family:          6
Model:               143
Model name:          Intel(R) Xeon(R) Platinum 8480+
Stepping:            6
CPU MHz:             2000.000
BogoMIPS:            4000.00
L1d cache:           48K
L1i cache:           32K
L2 cache:            2048K
L3 cache:            107520K
NUMA node0 CPU(s):   0-13
NUMA node1 CPU(s):   14-27
NUMA node2 CPU(s):   28-41
NUMA node3 CPU(s):   42-55
NUMA node4 CPU(s):   56-69
NUMA node5 CPU(s):   70-83
NUMA node6 CPU(s):   84-97
NUMA node7 CPU(s):   98-111
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pd
pe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds
_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowpr
efetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invp
cid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_ll
c cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts avx512vbmi umip pku ospke waitpkg avx5
12_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear se
rialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

NUMA Nodes Information:
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
node 0 size: 63920 MB
node 0 free: 63082 MB
node 1 cpus: 14 15 16 17 18 19 20 21 22 23 24 25 26 27
node 1 size: 64508 MB
node 1 free: 64118 MB
node 2 cpus: 28 29 30 31 32 33 34 35 36 37 38 39 40 41
node 2 size: 64508 MB
node 2 free: 64272 MB
node 3 cpus: 42 43 44 45 46 47 48 49 50 51 52 53 54 55
node 3 size: 64508 MB
node 3 free: 63319 MB
node 4 cpus: 56 57 58 59 60 61 62 63 64 65 66 67 68 69
node 4 size: 64508 MB
node 4 free: 64133 MB
node 5 cpus: 70 71 72 73 74 75 76 77 78 79 80 81 82 83
node 5 size: 64508 MB
node 5 free: 64139 MB
node 6 cpus: 84 85 86 87 88 89 90 91 92 93 94 95 96 97
node 6 size: 64508 MB
node 6 free: 63784 MB
node 7 cpus: 98 99 100 101 102 103 104 105 106 107 108 109 110 111
node 7 size: 64505 MB
node 7 free: 63377 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  12  12  12  21  21  21  21 
  1:  12  10  12  12  21  21  21  21 
  2:  12  12  10  12  21  21  21  21 
  3:  12  12  12  10  21  21  21  21 
  4:  21  21  21  21  10  12  12  12 
  5:  21  21  21  21  12  10  12  12 
  6:  21  21  21  21  12  12  10  12 
  7:  21  21  21  21  12  12  12  10 

Hyper-Threading Check:
Hyper-Threading is Disabled

Total Possible MPI Processes (Logical Cores):
CPU(s):              112

System Architecture:
Architecture:        x86_64
CPU Model:
Model name:          Intel(R) Xeon(R) Platinum 8480+

System information gathering complete.

Bottom Line

Given the above server specs, I'm trying to figure out the values of:

  • PxQ (I would have thought something like 14x15 since we want it split across two servers with 224 total cores, but that blows up with a series of errors)
  • What parameters I should pass in to mpirun to fully utilize all CPU cores
  • Some basic explanation of how those mpirun parameters correspond to threads/processes across the cluster

Solution

  • I spent a very long time figuring all this out and have written a full write up here on how Intel MKL works, how the math works, and how to optimize it.

    The TLDR answer to this question:

    Running under a job manager is a bit of a mess because you basically have to go reverse engineer the binary mpirun to see how the environment variables from SLURM in this case get used. I recommend not running with SLURM or any other job manager. You can do it, but you'll need to go look at exactly how MKL is using those environment variables. I talk about this in the guide.

    Here is the snippit of my guide that directly answers this question:


    If you're like me, you Google'd LINPACK and how to optimize. If you are also like me you landed on a bunch of open source documentation about how to tune it. Forget everything you read because that's not how the Intel version works.

    Inside of runme_intel64_dynamic you will see the following values:

    # Set total number of MPI processes for the HPL (should be equal to PxQ).
    export MPI_PROC_NUM=2
    
    # Set the MPI per node for each node.
    # MPI_PER_NODE should be equal to 1 or number of sockets on the system.
    # It will be same as -perhost or -ppn paramaters in mpirun/mpiexec.
    export MPI_PER_NODE=2
    
    # Set the number of NUMA nodes per MPI. (MPI_PER_NODE * NUMA_PER_MPI)
    # should be equal to number of NUMA nodes on the system.
    export NUMA_PER_MPI=1
    

    These are the heart of Intel's benchmark. There is more detail on the entire process flow for intel in the README. Here is what each value does and how to tune it. For the sake of example, let's say your setup is 3 servers, each with 112 cores and 8 NUMA domains a piece.

    • MPI_PROC_NUM: This is the total number of MPI processes that will run cluster wide. In my testing, I found that best performance was to have one MPI process per NUMA domain available. So if each server has 8, you would want 24 MPI processes total.
    • MPI_PER_NODE: This controls how many MPI processes are allocated to each node in a round robin fashion. When I say round robin, I mean you could do something like set this in our example to 4 and what MPI would do is assign the first 4 MPI processes to node 1, 4 after to node 2, another 4 to node 3, and then it would round robin back to node 1 and add 4 again to each to get the full 24 processes. In my testing this did not yield optimal results. What is better is to simply set MPI_PER_NODE to the number of NUMA domains per server so in our case 8.
    • NUMA_PER_MPI: This dictates how many NUMA domains each MPI process will span. If it is set to 1, each MPI process will only operate on a single NUMA domain, 2 each MPI process will span two NUMA domains. I show what this looks like in gory detail in the runme_intel64_prv section of the README. My best results were with this set to 1 and then I just had each MPI process bind to a single NUMA domain.

    CRITICAL You must heed the comment "Set the number of NUMA nodes per MPI. (MPI_PER_NODE * NUMA_PER_MPI) should be equal to number of NUMA nodes on the system.". Whatever config you test, MPI_PER_NODE*NUMA_PER_MPI=<number of NUMA nodes (on a single system)>. While it would be suboptimal you can have MPI_PER_NODE*NUMA_PER_MPI<number of system NUMA nodes. That config will run. However, if MPI_PER_NODE*NUMA_PER_MPI>number of system NUMA nodes, the test will fail out with memory errors.

    Finally, you will need to set p and q. Since Intel spawns threads automatically, p and q for Intel's variant is nothing like it is for regular MPI. This is one part with which you may want to experiment. The one key is that $p \times q = \text{MPI_PROC_NUM}$. I generally find best performance is when the numbers are as close as they can be. In our example, this with be $4 \times 6$ or maybe $2 \times 12$.