c++performance x86 compiler-optimization simd

How to achieve peak flop throughput for FMA when using input data (while maintaining the required roofline compute/load ratio)?

I try to achieve peak float throughput for SIMD FMA computations while loading input data. I load as much data as permitted by relative compute / memory load speeds. I also applied buffering to avoid immediate data dependencies, but this didn't help either. Unfortunately, the throughput is off by a factor of two.

Consider the following FMA function:

 std::pair<float, float> fmadd_256(const float* aptr, const float* bptr,
                                  uint64_t sz, uint32_t num_loads) {
    constexpr size_t num_lanes = 256 / (sizeof(float) * 8);
    __m256 c[8];
    __m256 a[2][num_loads];
    __m256 b[2][num_loads];
    load_buffer(aptr, bptr, a[0], b[0], 0);
    for (uint64_t i = 0; i < sz; i++) {
        uint64_t curr_buffer = i % 2;
        __m256* curra = a[curr_buffer];
        __m256* currb = b[curr_buffer];
        if (i < sz - 1) {
            uint64_t next_buffer = (i + 1) % 2;
            __m256* nexta = a[next_buffer];
            __m256* nextb = b[next_buffer];
            uint64_t load_position = (i + 1) * num_lanes * num_loads;
            load_buffer(aptr, bptr, nexta, nextb, load_position);
        }
        c[0] = _mm256_fmadd_ps(curra[0], currb[0], c[0]);
        c[1] = _mm256_fmadd_ps(curra[0], currb[0], c[1]);
        c[2] = _mm256_fmadd_ps(curra[0], currb[0], c[2]);
        c[3] = _mm256_fmadd_ps(curra[0], currb[0], c[3]);
        c[4] = _mm256_fmadd_ps(curra[1], currb[1], c[4]);
        c[5] = _mm256_fmadd_ps(curra[1], currb[1], c[5]);
        c[6] = _mm256_fmadd_ps(curra[1], currb[1], c[6]);
        c[7] = _mm256_fmadd_ps(curra[1], currb[1], c[7]);
    }
    constexpr size_t num_unrolls = 8;
    uint64_t flops{num_unrolls * num_lanes * 2 * sz};
    float res = epilogue(c);
    return {res, flops};
}

When running the code above, I get a GFLOPS/sec ratio of about 60, but my computer can achieve 130 GFLOPS/sec. When disabling memory loads, the code achieves 120 GFLOPS/sec. I used a Alderlake processor. According to Intel Advisor, the neccessary compute/memory load ratio to achieve peak throughput is 4.4 when taking the DRAM speed into consideration.

What could I do to improve the throughput of the program?

Interestingly, buffering does not improve the program's performance. When loading data into the used buffer (setting uint64_t next_buffer = i % 2), the code runs as fast as when I cycle over the two buffers. I could also used aligned vectors instead of unaligned vectors or rewrite the code to avoid the branch inside the loop, but I doubt that's the root cause for the relatively slow program.

The code is compiled with gcc 11.4.0 and the flags are -O2 -march=alderlake -g -ffast-math -Wall -Wextra. I make sure to run the program on one core. Theoretically, I compute max GFLOPS of 4.9 GHZ * 2 FMA * 2 UNITS * 8 Lanes ~ 156GFLOPS/sec. Intel Advisor shows me a max compute throughput of roughly ~130GFLOPS/sec.

Edit based on the Comments

The compiler's behavior

As noted in the comments below and upon inspecting the code on godbolt, the compiler produces much work. For me, the greatest obstacles to efficient code are the large number of memory operations and the low diversity of registers.

I count 12 vmovups operations per loop iteration, but four seem sufficient.
The compiler uses few registers and, even more problematic, the same registers to load buffered data and hold source data for the FMA ops.

I would guess the main loop would be something like the following for the compiler:

Use registers 0-7 as destination registers for the FMA ops
When i % 2 == 0:
- Prefetch data into registers 8-11
- Use registers 12-15 as source registers for the FMA operation
When i % 2 == 1:
- Prefetch data into registers 12-15
- Use registers 8-11 as source registers

Attempts tried that didn't work

Based on the comments, I tried the following:

Removing the branch condition inside the loop by reducing the loop iterations by one iteration and lifting code before and after the loop
More aggressive optimizations (O3) and different compiler (clang 17.0.1)

Bugs in the original program

In the original program, I wrote num_lanes = 256 / sizeof(float); but it should be num_lanes = 256 / (sizeof(float) * 8);

Full Program The full program for the test is below:

#include <immintrin.h>
#include <smmintrin.h>

#include <chrono>
#include <cstddef>
#include <cstdint>
#include <cstdlib>
#include <iostream>
#include <vector>

inline float hadd(__m256 v8) {
    __m128 v = _mm256_extractf128_ps(v8, 1);
    v = _mm_add_ps(v, _mm256_castps256_ps128(v8));
    v = _mm_add_ps(v, _mm_movehl_ps(v, v));
    v = _mm_add_ss(v, _mm_movehdup_ps(v));
    return _mm_cvtss_f32(v);
}

float epilogue(__m256 c[8]) {
    c[0] = _mm256_add_ps(c[0], c[1]);
    c[2] = _mm256_add_ps(c[2], c[3]);

    c[4] = _mm256_add_ps(c[4], c[5]);
    c[6] = _mm256_add_ps(c[6], c[7]);

    c[0] = _mm256_add_ps(c[0], c[3]);
    c[4] = _mm256_add_ps(c[4], c[6]);

    c[0] = _mm256_add_ps(c[0], c[4]);
    float res = hadd(c[0]);
    return res;
}

template <typename T>
void reporting(T duration, double flops, float res) {
    float gflops_sec = (flops * 1000.0) / duration;
    std::cout << "THe inner product is: " << res << std::endl;
    std::cout << "GFLOPS/sec: " << gflops_sec << std::endl;
    std::cout << "total gflops: " << flops << std::endl;
    std::cout << "Duration: " << duration << std::endl;
}

void load_buffer(const float* aptr, const float* bptr, __m256* a, __m256* b,
                 uint64_t idx) {
    a[0] = _mm256_loadu_ps(aptr + idx);
    a[1] = _mm256_loadu_ps(aptr + idx + 8);
    b[0] = _mm256_loadu_ps(bptr + idx);
    b[1] = _mm256_loadu_ps(bptr + idx + 8);
}
std::pair<float, float> fmadd_256(const float* aptr, const float* bptr,
                                  uint64_t sz, uint32_t num_loads) {
    constexpr size_t num_lanes = 256 / (sizeof(float) * 8);
    __m256 c[8];
    __m256 a[2][num_loads];
    __m256 b[2][num_loads];
    load_buffer(aptr, bptr, a[0], b[0], 0);
    for (uint64_t i = 0; i < sz; i++) {
        uint64_t curr_buffer = i % 2;
        __m256* curra = a[curr_buffer];
        __m256* currb = b[curr_buffer];
        if (i < sz - 1) {
            uint64_t next_buffer = (i) % 2;
            __m256* nexta = a[next_buffer];
            __m256* nextb = b[next_buffer];
            uint64_t load_position = (i + 0) * num_lanes * num_loads;
            load_buffer(aptr, bptr, nexta, nextb, load_position);
        }
        c[0] = _mm256_fmadd_ps(curra[0], currb[0], c[0]);
        c[1] = _mm256_fmadd_ps(curra[0], currb[0], c[1]);
        c[2] = _mm256_fmadd_ps(curra[0], currb[0], c[2]);
        c[3] = _mm256_fmadd_ps(curra[0], currb[0], c[3]);
        c[4] = _mm256_fmadd_ps(curra[1], currb[1], c[4]);
        c[5] = _mm256_fmadd_ps(curra[1], currb[1], c[5]);
        c[6] = _mm256_fmadd_ps(curra[1], currb[1], c[6]);
        c[7] = _mm256_fmadd_ps(curra[1], currb[1], c[7]);
    }
    constexpr size_t num_unrolls = 8;
    uint64_t flops{num_unrolls * num_lanes * 2 * sz};
    float res = epilogue(c);
    return {res, flops};
}

int main(int argc, char** argv) {
    constexpr uint64_t sz = {1000000};
    constexpr uint64_t num_loads = 2;
    constexpr uint64_t num_lanes = 256 / sizeof(float);
    constexpr uint64_t eles{num_loads * num_lanes * sz};
    std::vector<float> a(eles);
    std::vector<float> b(eles);
    double total_res{0};
    constexpr size_t RUNS{100};
    fmadd_256(a.data(), b.data(), sz, num_loads);  // test call
    uint64_t total_flops{0};
    auto begin = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < RUNS; ++i) {
        auto [res, flops] = fmadd_256(a.data(), b.data(), sz, num_loads);
        total_flops += flops;
        total_res += res;
        std::swap(a, b);
    }
    double total_gflops = 1e-9 * (double)total_flops;
    auto end = std::chrono::high_resolution_clock::now();
    double duration =
        std::chrono::duration_cast<std::chrono::milliseconds>(end - begin)
            .count();
    reporting(duration, total_gflops, total_res);
}

Solution

I finally understand what's going on: The data size I used in the example above is larger than the cache of my computer. The relationship between kb in the two input vectors and the GLOPS/sec is:

| Operation   | GFLOPS/Sec       |
|-------------|------------------|
| FMADD/12    | 127.004/s        |
| FMADD/24    | 123.528/s        |
| FMADD/32    | 124.203/s        |
| FMADD/48    | 125.133/s        |
| FMADD/56    | 125.023/s        |
| FMADD/128   | 109.202/s        |
| FMADD/256   | 91.519/s         |
| FMADD/1024  | 86.1444/s        |
| FMADD/2048  | 85.9394/s        |
| FMADD/4096  | 78.2796/s        |
| FMADD/8192  | 53.0381/s        |
| FMADD/16384 | 39.7985/s        |
| FMADD/32768 | 34.2683/s        |