Avoid Frequency Scaling for SIMD FMA Performance

The following program shows very variable performance when run for a different number of iterations. What could be the reason, and how can I get consistent measurements?

The program profiles the max flops a CPU core attains. It is:

#include <immintrin.h>
#include <smmintrin.h>

#include <chrono>
#include <cstddef>
#include <iostream>

std::pair<float, float> fmadd_256(__m256 a[8], __m256 b[8], size_t sz) {
    __m256 c[8];
    float total_gflops{0};
    for (std::size_t i = 0; i < sz; i++) {
        c[0] = _mm256_fmadd_ps(a[0], b[0], c[0]);
        c[1] = _mm256_fmadd_ps(a[1], b[1], c[1]);
        c[2] = _mm256_fmadd_ps(a[2], b[2], c[2]);
        c[3] = _mm256_fmadd_ps(a[3], b[3], c[3]);
        c[4] = _mm256_fmadd_ps(a[4], b[4], c[4]);
        c[5] = _mm256_fmadd_ps(a[5], b[5], c[5]);
        c[6] = _mm256_fmadd_ps(a[6], b[6], c[6]);
        c[7] = _mm256_fmadd_ps(a[7], b[7], c[7]);

        total_gflops += 8 * 8 * 2;
    }
    float res = epilogue(c);
    return {res, total_gflops};
}

float epilogue(__m256 c[8]) {
    c[0] = _mm256_add_ps(c[0], c[1]);
    c[2] = _mm256_add_ps(c[2], c[3]);

    c[4] = _mm256_add_ps(c[4], c[5]);
    c[6] = _mm256_add_ps(c[6], c[7]);

    c[0] = _mm256_add_ps(c[0], c[3]);
    c[4] = _mm256_add_ps(c[4], c[6]);

    c[0] = _mm256_add_ps(c[0], c[4]);
    float res{0.0};
    for (size_t i = 0; i < 8; ++i) {
        res += c[0][i];
    }
    return res;
}

template <typename T>
void reporting(T duration, float flops, float res) {
    float gflops_sec = (flops * 1000.0 * 1000) / duration;
    std::cout << "THe inner product is: " << res << std::endl;
    std::cout << "GFLOPS/sec: " << gflops_sec << std::endl;
    std::cout << "total gflops: " << flops << std::endl;
    std::cout << "Duration: " << duration << std::endl;
}


int main(int argc, char** argv) {
    __m256 a[8];
    __m256 b[8];
    double total_res{0};
    constexpr size_t iters{5 * 10000000 / 1};
    constexpr size_t RUNS{100};
    fmadd_256(a, b, iters);  // test call
    double total_gflops{0};

    auto begin = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < RUNS; ++i) {
        auto [res, gflops] = fmadd_256(a, b, iters);
        total_gflops += gflops;
        total_res += res;
        std::swap(a, b);
    }
    total_gflops *= 1e-9;
    auto end = std::chrono::high_resolution_clock::now();
    double duration =
        std::chrono::duration_cast<std::chrono::microseconds>(end - begin)
            .count();

    reporting(duration, total_gflops, total_res);
}

The fmadd_256 function computes the FMAs, and the main function stages this. There's also an epilogue that sums the values from the accumulators and a reporting function that prints the GFLOPS/sec to stdout. The runtime depends on the precise value of iters: For small values, the GLOPS/sec are unstable (which makes sense), then the GLOPS/sec increase and plateau around ~120-130. Beyond a certain point, they are still consistent but fall. Here's a small summary table:

Itertions	GFLOPS/sec
1*10^6	120
1*10^7	120
2*10^7	100
3*10^7	66
5*10^7	40

My questions are:

Why is the runtime (beyond small iterations) so different?
Can I verify the clock frequency at high enough frequency to identify that frequency scaling is the root cause?
Furthermore, is there a way to prevent the volatility (even if the result is below peak performance).

I tied the program to a particular CPU with taskset and looked at the CPU scaling visible under /proc/cpuinfo at 0.1sec frequency. The frequency was slightly volatile, but the reporting interval was too long to document significant frequency scales. I also tried disabling frequency scaling with cpupower frequency-set --governor performance. That command exited without error, but /proc/cpuinfo still showed variations and the program didn't run consistently either.

I am running the benchmark on an Alderlake laptop.

Solution

Your test program has multiple bugs. On my computer, your code does not even compile.

error C2676: binary '[': '__m256' does not define this operator or a conversion to a type acceptable to the predefined operator

Fixed with the following function to add all lanes in the vector:

inline float hadd( __m256 v8 )
{
    __m128 v = _mm256_extractf128_ps( v8, 1 );
    v = _mm_add_ps( v, _mm256_castps256_ps128( v8 ) );
    v = _mm_add_ps( v, _mm_movehl_ps( v, v ) );
    v = _mm_add_ss( v, _mm_movehdup_ps( v ) );
    return _mm_cvtss_f32( v );
}

Anyway, the main issue is that line:

total_gflops += 8 * 8 * 2;

Your total_gflops number only has FP32 precision. You can’t increment it 2*10^7 times by +128, FP32 doesn’t have enough precision for that.

With 1E6 iterations your code prints 146 GFlops on my computer. With 2E7 iterations it prints 123 GFlops, with 5E7 it prints 48.5 GFlops.

However, once I replaced the float total_gflops{0}; with double total_gflops = 0; it returned back to normal, reporting 147 GFlops for 5E7 iterations. That’s because FP64 numbers have better precision, enough to increment a value 5E7 times.