Measure CPU speed by counting assembly instructions

Edit: My original example had a silly mistake. After fixing it I still get weird results, though.

In my naive attempt to measure my CPU speed the "brute-force" way, I made the program below:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#pragma comment(linker, "/entry:mainCRTStartup")
#pragma comment(linker, "/Subsystem:Console")

int mainCRTStartup()
{
    char buf[20];
    clock_t start, elapsed;
    unsigned long count = 0;
    start = clock();
    __asm
    {
        mov EAX, 0;
    _loop:
        add EAX, 3; // accounts for itself and next 2 instructions
        cmp EAX, 0xFFFFFFFF - 0x400;
        jb _loop;
        mov count, EAX;
    }
    elapsed = clock() - start;
    _gcvt(count * (long long)CLOCKS_PER_SEC / (elapsed * 1000000000.0), 3, buf);
    puts(buf);
}

Which disassembles into something like:

mainCRTStartup:
  push   ebp
  mov    ebp,esp
  sub    esp,28h
  mov    dword ptr [count],0
  call   dword ptr [_clock]
  mov    dword ptr [start],eax
  mov    eax,0

_loop:
  add    eax,03h
  cmp    eax,0FFFFFBFFh
  jb     _loop

  mov    dword ptr [count],eax
  call   dword ptr [_clock]
  sub    eax,dword ptr [start]

  ...    // call _gcvt, _puts, etc.

  mov    esp,ebp
  pop    ebp
  ret

Notice that the loop is 3 instructions, so the final value of eax should be the total number of instructions.

Why do I get 4.2 when I run this?

Solution

Because instruction-level parallelism and superscalar architecture allow multiple instructions to execute in a single pipelined clock cycle.

For example, in your code, branch prediction effectively eliminates the cmp instruction for all but the last _loop iteration, by:

executing cmp and jb in parallel, and
always taking the jb branch.

Of course, (2) is thrown out on the last iteration, which causes the pipeline to be cleared. The extra ~20 cycles (for a 20-stage pipeline) are negligible since your loop is on the order of 10^9 instructions.

the compiler shouldn't be optimizing this

The processor hardware is always looking for optimization opportunities in the datapath; compilers just try to organize instructions to exploit a given architecture's patterns. E.g., hardware pipelining can increase IPC without software pipelining, especially for relatively hazard-free code such as your example.