Search code examples
performanceassemblyx86cpu-architecturecpu-speed

Exactly how "fast" are modern CPUs?


When I used to program embedded systems and early 8/16-bit PCs (6502, 68K, 8086) I had a pretty good handle on exacly how long (in nanoseconds or microseconds) each instruction took to execute. Depending on family, one (or four) cycles equated to one "memory fetch", and without caches to worry about, you could guess timings based on the number of memory accesses involved.

But with modern CPU's, I'm confused. I know they're a lot faster, but I also know that the headline gigahertz speed isn't helpful without knowing how many cycles of that clock are needed for each instruction.

So, can anyone provide some timings for two sample instructions, on (let's say) a 2GHz Core 2 Duo. Best and worst cases (assuming nothing in cache/everything in cache) would be useful.

Instruction #1: Add one 32-bit register to a second.

Instruction #2: Move a 32-bit value from register to memory.

Edit: The reason I ask this is to try and develop a "rule-of-thumb" that would allow me to look at simple code and roughly gauge the time taken to the nearest order of magnitude.

Edit #2: Lots of answers with interesting points, but nobody (yet) has put down a figure measured in time. I appreciate there are "complications" to the question, but c'mon: If we can estimate the number of piano-tuners in NYC, we should be able to estimate code runtimes...

Take the following (dumb) code:

int32 sum = frigged_value();

// start timing
 for (int i = 0 ; i < 10000; i++)
 {
   for (int j = 0 ; j < 10000; j++)
   {
     sum += (i * j)
   }
   sum = sum / 1000;
 }

// end timing

How can we estimate how long it will take to run... 1 femtosecond? 1 gigayear?


Solution

  • Modern processors such as Core 2 Duo that you mention are both superscalar and pipelined. They have multiple execution units per core and are actually working on more than one instruction at a time per core; this is the superscalar part. The pipelined part means that there is a latency from when an instruction is read in and "issued" to when it completes execution and this time varies depending on the dependencies between that instruction and the others moving through the other execution units at the same time. So, in effect, the timing of any given instruction varies depending on what is around it and what it is depending on. This means that a given instruction has sort of a best case and worst case execution time based on a number of factors. Because of the multiple execution units you can actually have more than one instruction completing execution per core clock, but sometimes there is several clocks between completions if the pipeline has to stall waiting for memory or dependencies in the pipelines.

    All of the above is just from the view of the CPU core itself. Then you have interactions with the caches and contention for bandwidth with the other cores. The Bus Interface Unit of the CPU deals with getting instructions and data fed into the core and putting results back out of the core through the caches to memory.

    Rough order of magnitude rules of thumb to be taken with a grain of salt:

    • Register to Register operations take 1 core clock to execute. This should generally be conservative especially as more of these appear in sequence.
    • Memory related load and store operations take 1 memory bus clock to execute. This should be very conservative. With a high cache hit rate it will be more like 2 CPU bus clocks which is the clock rate of the bus between the CPU core and the cache, but not necessarily the core's clock.