How to maximize application performance for ARM little.big architecture - MPI

I am writing an MPI application to speedup a math algorithm with ARM cored device. The device has a S922X CPU which integrates a quad-core ARM Cortex-A73 cpu and a two core Cortex-A53 CPU.

I am wondering, with tuning of the compiler, or selecting a different compiler, can I expect more speedup for my application?

I was playing with possible options of the mpic++ compiler like -O1, -O3, -Ofast, -ffast-math -march=native ... etc.

The final option was this: -Wall -Wextra -std=c++11 -Ofast

And the build application could run on both cores. However they have different instruction sets so I think the binary is not maximized yet for performance.

the capabilities of the two cores are describe in the datasheet

Cortex-A53 processor features

Armv8 Architecture ARM, Thumb, and ThumbEE instruction set support
Media Processing Engine (MPE) with NEON technology

Cortex-A73 processor features

Armv8-A Architecture
NEON advanced SIMD
DSP & SIMD extensions
VFPv4 floating point
Supports Hardware virtualization

How can I use the powerful features of the A73 core to speedup more my application? What is the best approach?

By the way from my previous post I became enlightened I must use the BIG core if I want max performance:

C/C++ MPI speedup is not as expected

Solution

Your problem is twofold.

First, there are cores with varying instruction sets. Most MPI implementations provide an easy solution for that by allowing you to run jobs from more than one executable. You simply need to compile the code twice with core-specific optimisations in order to produce two executable files. Let's call them prog.big (optimised for the big cores) and prog.little (optimised for the LITTLE cores). Then, instead of launching 6 ranks from a generic executable with mpiexec -n 6 ./prog, you launch 4 ranks from prog.big and 2 ranks from prog.little:

mpiexec -n 4 ./prog.bin : -n 2 ./prog.little

That's not enough though. You need to place the right process on the right core. Doing so is very implementation-specific. In the simplest case, you can tell MPI to pin/bind each MPI rank to a single logical CPU and do so in a linear fashion, i.e., rank 0 gets bound to core 0, rank 1 to core 1, etc. and hope that the OS will map the big cores to logical CPUs 0 to 3 and the LITTLE cores to logical CPUs 4 and 5. If that is not the case, you may need to perform some additional acrobatics. For example, Open MPI allows you to specify a rankfile with --rankfile filename, in which you can provide a rank to CPU mapping:

rank 0=localhost slot=0
rank 1=localhost slot=1
rank 2=localhost slot=2
...

Having optimised executable files and properly placed processes is only half of the solution. The rest is to actually have a parallel algorithm that can make use of CPUs with different speeds. If you have a globally synchronous algorithm, for example one solving PDEs, or anything iterative in general, then the computation time of a single step is that of the slowest MPI rank. If you give the same amount of work to the big and to the LITTLE cores, the latter will lag significantly and the former will have to wait, wasting computational time. So you need to either perform some advanced domain decomposition and give smaller work items to the slower cores or use an approach such as "bag of work" (a.k.a. controller/worker) and have each worker rank request a piece of data to work on. In this case, faster cores will process more items and the work will balance itself automatically.