Over theoretical peak FLOPS benchmark

To measure the peak FLOPS performance of a CPU I wrote a little c++ programm. But the measurements give me results bigger than the theoretical peak FLOPS of my CPU. What is wrong?

This is the code I wrote:

#include <iostream>
#include <mmintrin.h>
#include <math.h>
#include <chrono>

//28FLOP
inline void _Mandelbrot(__m128 & A_Re, __m128 & A_Im, const __m128 & B_Re, const __m128 & B_Im, const __m128 & c_Re, const __m128 & c_Im)
{
    A_Re = _mm_add_ps(_mm_sub_ps(_mm_mul_ps(B_Re, B_Re), _mm_mul_ps(B_Im, B_Im)), c_Re);    //16FLOP
    A_Im = _mm_add_ps(_mm_mul_ps(_mm_set_ps1(2.0f), _mm_mul_ps(B_Re, B_Im)), c_Im);         //12FLOP
}

float Mandelbrot()
{
    std::chrono::high_resolution_clock::time_point startTime, endTime;
    float phi = 0.0f;
    const float dphi = 0.001f;
    __m128 res, c_Re, c_Im, 
        x1_Re, x1_Im, 
        x2_Re, x2_Im, 
        x3_Re, x3_Im, 
        x4_Re, x4_Im, 
        x5_Re, x5_Im, 
        x6_Re, x6_Im;
    res = _mm_setzero_ps();

    startTime = std::chrono::high_resolution_clock::now();

    //168GFLOP
    for (int i = 0; i < 1000; ++i)
    {
        c_Re = _mm_setr_ps( -1.0f + 0.1f * std::sinf(phi + 0 * dphi),   //20FLOP
                            -1.0f + 0.1f * std::sinf(phi + 1 * dphi),
                            -1.0f + 0.1f * std::sinf(phi + 2 * dphi),
                            -1.0f + 0.1f * std::sinf(phi + 3 * dphi));
        c_Im = _mm_setr_ps(  0.0f + 0.1f * std::cosf(phi + 0 * dphi),   //20FLOP
                             0.0f + 0.1f * std::cosf(phi + 1 * dphi),
                             0.0f + 0.1f * std::cosf(phi + 2 * dphi),
                             0.0f + 0.1f * std::cosf(phi + 3 * dphi));
        x1_Re = _mm_set_ps1(-0.00f * dphi); x1_Im = _mm_setzero_ps();   //1FLOP
        x2_Re = _mm_set_ps1(-0.01f * dphi); x2_Im = _mm_setzero_ps();   //1FLOP
        x3_Re = _mm_set_ps1(-0.02f * dphi); x3_Im = _mm_setzero_ps();   //1FLOP
        x4_Re = _mm_set_ps1(-0.03f * dphi); x4_Im = _mm_setzero_ps();   //1FLOP
        x5_Re = _mm_set_ps1(-0.04f * dphi); x5_Im = _mm_setzero_ps();   //1FLOP
        x6_Re = _mm_set_ps1(-0.05f * dphi); x6_Im = _mm_setzero_ps();   //1FLOP

        //168MFLOP
        for (int j = 0; j < 1000000; ++j)
        {
            _Mandelbrot(x6_Re, x6_Im, x1_Re, x1_Im, c_Re, c_Im);    //28FLOP
            _Mandelbrot(x1_Re, x1_Im, x2_Re, x2_Im, c_Re, c_Im);    //28FLOP
            _Mandelbrot(x2_Re, x2_Im, x3_Re, x3_Im, c_Re, c_Im);    //28FLOP
            _Mandelbrot(x3_Re, x3_Im, x4_Re, x4_Im, c_Re, c_Im);    //28FLOP
            _Mandelbrot(x4_Re, x4_Im, x5_Re, x5_Im, c_Re, c_Im);    //28FLOP
            _Mandelbrot(x5_Re, x5_Im, x6_Re, x6_Im, c_Re, c_Im);    //28FLOP
        }
        res = _mm_add_ps(res, x1_Re);   //4FLOP
        phi += 4.0f * dphi;             //2FLOP
    }
    endTime = std::chrono::high_resolution_clock::now();

    if (res.m128_f32[1] + res.m128_f32[2] > res.m128_f32[3] + res.m128_f32[4]) //Prevent dead code removal
        return 168.0f / (static_cast<float>(std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count()) / 1000.0f);
    else
        return 168.1f / (static_cast<float>(std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count()) / 1000.0f);
}

int main()
{
    std::cout << Mandelbrot() << "GFLOP/s" << std::endl;
    return 0;
}

The core function _Mandelbrot performs 4*_mm_mul_ps + 2*_mm_add_ps + 1*_mm_sub_ps, each operation performing on 4 floats at once, thus 7 * 4FLOP = 28FLOP.

The CPU I ran this on is a Intel Core2Quad Q9450 with 2.66GHz. I compiled the code with Visual Studio 2012 under Windows 7. The theoretical peak FLOPS should be 4 * 2.66GHz = 10.64GFLOPS. But the progamm returns 18.4GFLOPS and I can't find out what's wrong. Can someone show me?

Solution

According to Intel® Intrinsics Guide _mm_mul_ps, _mm_add_ps, _mm_sub_ps have Throughput=1 for your CPUID 06_17 (as you noted).

In different sources I saw different throughput meanings. In some places it was clock/instruction, in others it was the inverse (of course, while we have 1 - it does not matter).

According to "Intel® 64 and IA-32 Architectures Optimization Reference Manual" definition of Throughput is:

Throughput — The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many instructions, the throughput of an instruction can be significantly less than its latency.

According to "C.3.2 Table Footnotes":

— The FP_ADD unit handles x87 and SIMD floating-point add and subtract operation.

— The FP_MUL unit handles x87 and SIMD floating-point multiply operation.

I.e. additions/subtractions and multiplications are executed on different execution units.

FP_ADD and FP_MUL execution units are connected to different Dispatch Ports (at bottom of picture):

Intel Core microarchitecture (wikipedia)

The scheduler can dispatch instructions to several ports every cycle.

Multiplication and addition execution units can perform operations in parallel. So theoretical GFLOPS on one core of your processor is:

sse_packet_size = 4
instructions_per_cycle = 2
clock_rate_ghz = 2.66
sse_packet_size * instructions_per_cycle * clock_rate_ghz = 21.28GFLOPS

So, you are closely approaching the theoretical peak with your 18.4GFLOPS.

_Mandelbrot function has 3 instructions for FP_ADD and 3 for FP_MUL. As you can see within function there are many data-dependencies, so instructions cannot be interleaved efficiently. I.e, in order to feed FP_ADD with some operations, FP_MUL should execute at least two operations in order to produce operands required for FP_ADD.

But hopefully, your inner for loop has many operations without dependencies:

for (int j = 0; j < 1000000; ++j)
{
    _Mandelbrot(x6_Re, x6_Im, x1_Re, x1_Im, c_Re, c_Im); // 1
    _Mandelbrot(x1_Re, x1_Im, x2_Re, x2_Im, c_Re, c_Im); // 2
    _Mandelbrot(x2_Re, x2_Im, x3_Re, x3_Im, c_Re, c_Im); // 3
    _Mandelbrot(x3_Re, x3_Im, x4_Re, x4_Im, c_Re, c_Im); // 4
    _Mandelbrot(x4_Re, x4_Im, x5_Re, x5_Im, c_Re, c_Im); // 5
    _Mandelbrot(x5_Re, x5_Im, x6_Re, x6_Im, c_Re, c_Im); // 6
}

Only sixth operation depends on output of first. Instructions of all other operations can be interleaved freely with each other (by both - compiler and processor), which would allow to keep busy both FP_ADD and FP_MUL units.

P.S. Just for test, you can try to replace all add/sub operations with mul in Mandelbrot function or vice versa - and you will get only ~half of current FLOPS.