Edit: My original example had a silly mistake. After fixing it I still get weird results, though.
In my naive attempt to measure my CPU speed the "brute-force" way, I made the program below:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#pragma comment(linker, "/entry:mainCRTStartup")
#pragma comment(linker, "/Subsystem:Console")
int mainCRTStartup()
{
char buf[20];
clock_t start, elapsed;
unsigned long count = 0;
start = clock();
__asm
{
mov EAX, 0;
_loop:
add EAX, 3; // accounts for itself and next 2 instructions
cmp EAX, 0xFFFFFFFF - 0x400;
jb _loop;
mov count, EAX;
}
elapsed = clock() - start;
_gcvt(count * (long long)CLOCKS_PER_SEC / (elapsed * 1000000000.0), 3, buf);
puts(buf);
}
Which disassembles into something like:
mainCRTStartup:
push ebp
mov ebp,esp
sub esp,28h
mov dword ptr [count],0
call dword ptr [_clock]
mov dword ptr [start],eax
mov eax,0
_loop:
add eax,03h
cmp eax,0FFFFFBFFh
jb _loop
mov dword ptr [count],eax
call dword ptr [_clock]
sub eax,dword ptr [start]
... // call _gcvt, _puts, etc.
mov esp,ebp
pop ebp
ret
Notice that the loop is 3 instructions, so the final value of eax
should be the total number of instructions.
Why do I get 4.2 when I run this?
Because instruction-level parallelism and superscalar architecture allow multiple instructions to execute in a single pipelined clock cycle.
For example, in your code, branch prediction effectively eliminates the cmp
instruction for all but the last _loop
iteration, by:
cmp
and jb
in parallel, and jb
branch.Of course, (2) is thrown out on the last iteration, which causes the pipeline to be cleared. The extra ~20 cycles (for a 20-stage pipeline) are negligible since your loop is on the order of 10^9 instructions.
the compiler shouldn't be optimizing this
The processor hardware is always looking for optimization opportunities in the datapath; compilers just try to organize instructions to exploit a given architecture's patterns. E.g., hardware pipelining can increase IPC without software pipelining, especially for relatively hazard-free code such as your example.