To measure the peak FLOPS performance of a CPU I wrote a little c++ programm. But the measurements give me results bigger than the theoretical peak FLOPS of my CPU. What is wrong?
This is the code I wrote:
#include <iostream>
#include <mmintrin.h>
#include <math.h>
#include <chrono>
//28FLOP
inline void _Mandelbrot(__m128 & A_Re, __m128 & A_Im, const __m128 & B_Re, const __m128 & B_Im, const __m128 & c_Re, const __m128 & c_Im)
{
A_Re = _mm_add_ps(_mm_sub_ps(_mm_mul_ps(B_Re, B_Re), _mm_mul_ps(B_Im, B_Im)), c_Re); //16FLOP
A_Im = _mm_add_ps(_mm_mul_ps(_mm_set_ps1(2.0f), _mm_mul_ps(B_Re, B_Im)), c_Im); //12FLOP
}
float Mandelbrot()
{
std::chrono::high_resolution_clock::time_point startTime, endTime;
float phi = 0.0f;
const float dphi = 0.001f;
__m128 res, c_Re, c_Im,
x1_Re, x1_Im,
x2_Re, x2_Im,
x3_Re, x3_Im,
x4_Re, x4_Im,
x5_Re, x5_Im,
x6_Re, x6_Im;
res = _mm_setzero_ps();
startTime = std::chrono::high_resolution_clock::now();
//168GFLOP
for (int i = 0; i < 1000; ++i)
{
c_Re = _mm_setr_ps( -1.0f + 0.1f * std::sinf(phi + 0 * dphi), //20FLOP
-1.0f + 0.1f * std::sinf(phi + 1 * dphi),
-1.0f + 0.1f * std::sinf(phi + 2 * dphi),
-1.0f + 0.1f * std::sinf(phi + 3 * dphi));
c_Im = _mm_setr_ps( 0.0f + 0.1f * std::cosf(phi + 0 * dphi), //20FLOP
0.0f + 0.1f * std::cosf(phi + 1 * dphi),
0.0f + 0.1f * std::cosf(phi + 2 * dphi),
0.0f + 0.1f * std::cosf(phi + 3 * dphi));
x1_Re = _mm_set_ps1(-0.00f * dphi); x1_Im = _mm_setzero_ps(); //1FLOP
x2_Re = _mm_set_ps1(-0.01f * dphi); x2_Im = _mm_setzero_ps(); //1FLOP
x3_Re = _mm_set_ps1(-0.02f * dphi); x3_Im = _mm_setzero_ps(); //1FLOP
x4_Re = _mm_set_ps1(-0.03f * dphi); x4_Im = _mm_setzero_ps(); //1FLOP
x5_Re = _mm_set_ps1(-0.04f * dphi); x5_Im = _mm_setzero_ps(); //1FLOP
x6_Re = _mm_set_ps1(-0.05f * dphi); x6_Im = _mm_setzero_ps(); //1FLOP
//168MFLOP
for (int j = 0; j < 1000000; ++j)
{
_Mandelbrot(x6_Re, x6_Im, x1_Re, x1_Im, c_Re, c_Im); //28FLOP
_Mandelbrot(x1_Re, x1_Im, x2_Re, x2_Im, c_Re, c_Im); //28FLOP
_Mandelbrot(x2_Re, x2_Im, x3_Re, x3_Im, c_Re, c_Im); //28FLOP
_Mandelbrot(x3_Re, x3_Im, x4_Re, x4_Im, c_Re, c_Im); //28FLOP
_Mandelbrot(x4_Re, x4_Im, x5_Re, x5_Im, c_Re, c_Im); //28FLOP
_Mandelbrot(x5_Re, x5_Im, x6_Re, x6_Im, c_Re, c_Im); //28FLOP
}
res = _mm_add_ps(res, x1_Re); //4FLOP
phi += 4.0f * dphi; //2FLOP
}
endTime = std::chrono::high_resolution_clock::now();
if (res.m128_f32[1] + res.m128_f32[2] > res.m128_f32[3] + res.m128_f32[4]) //Prevent dead code removal
return 168.0f / (static_cast<float>(std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count()) / 1000.0f);
else
return 168.1f / (static_cast<float>(std::chrono::duration_cast<std::chrono::milliseconds>(endTime - startTime).count()) / 1000.0f);
}
int main()
{
std::cout << Mandelbrot() << "GFLOP/s" << std::endl;
return 0;
}
The core function _Mandelbrot performs 4*_mm_mul_ps + 2*_mm_add_ps + 1*_mm_sub_ps, each operation performing on 4 floats at once, thus 7 * 4FLOP = 28FLOP.
The CPU I ran this on is a Intel Core2Quad Q9450 with 2.66GHz. I compiled the code with Visual Studio 2012 under Windows 7. The theoretical peak FLOPS should be 4 * 2.66GHz = 10.64GFLOPS. But the progamm returns 18.4GFLOPS and I can't find out what's wrong. Can someone show me?
According to Intel® Intrinsics Guide _mm_mul_ps
, _mm_add_ps
, _mm_sub_ps
have Throughput=1
for your CPUID 06_17
(as you noted).
In different sources I saw different throughput meanings. In some places it was clock/instruction
, in others it was the inverse (of course, while we have 1
- it does not matter).
According to "Intel® 64 and IA-32 Architectures Optimization Reference Manual" definition of Throughput
is:
Throughput
— The number of clock cycles required to wait before the issue ports are free to accept the same instruction again. For many instructions, the throughput of an instruction can be significantly less than its latency.
According to "C.3.2 Table Footnotes":
— The FP_ADD unit handles x87 and SIMD floating-point add and subtract operation.
— The FP_MUL unit handles x87 and SIMD floating-point multiply operation.
I.e. additions/subtractions and multiplications are executed on different execution units.
FP_ADD
and FP_MUL
execution units are connected to different Dispatch Ports (at bottom of picture):
The scheduler can dispatch instructions to several ports every cycle.
Multiplication and addition execution units can perform operations in parallel. So theoretical GFLOPS on one core of your processor is:
sse_packet_size = 4
instructions_per_cycle = 2
clock_rate_ghz = 2.66
sse_packet_size * instructions_per_cycle * clock_rate_ghz = 21.28GFLOPS
So, you are closely approaching the theoretical peak with your 18.4GFLOPS.
_Mandelbrot
function has 3 instructions for FP_ADD and 3 for FP_MUL. As you can see within function there are many data-dependencies, so instructions cannot be interleaved efficiently. I.e, in order to feed FP_ADD with some operations, FP_MUL should execute at least two operations in order to produce operands required for FP_ADD.
But hopefully, your inner for
loop has many operations without dependencies:
for (int j = 0; j < 1000000; ++j)
{
_Mandelbrot(x6_Re, x6_Im, x1_Re, x1_Im, c_Re, c_Im); // 1
_Mandelbrot(x1_Re, x1_Im, x2_Re, x2_Im, c_Re, c_Im); // 2
_Mandelbrot(x2_Re, x2_Im, x3_Re, x3_Im, c_Re, c_Im); // 3
_Mandelbrot(x3_Re, x3_Im, x4_Re, x4_Im, c_Re, c_Im); // 4
_Mandelbrot(x4_Re, x4_Im, x5_Re, x5_Im, c_Re, c_Im); // 5
_Mandelbrot(x5_Re, x5_Im, x6_Re, x6_Im, c_Re, c_Im); // 6
}
Only sixth operation depends on output of first. Instructions of all other operations can be interleaved freely with each other (by both - compiler and processor), which would allow to keep busy both FP_ADD
and FP_MUL
units.
P.S. Just for test, you can try to replace all add
/sub
operations with mul
in Mandelbrot
function or vice versa - and you will get only ~half of current FLOPS.