I wrote a simple program that executes a bunch of NOP instructions in a loop, and to my surprise it executes about 10600000000 of them per second, or about 10Ghz, while my CPU is only 2.2GHz.
How is this possible? Is the CPU treating them as a single mega-NOP, or did I just discover what "instruction level parallelism" means?
What would be a better measure for instructions per second? Doing add instructions reaches only 414900000/s, a tenth of the bogomips reported by my CPU: 4390.03
C code:
#include <stdio.h>
#include <stdint.h>
#include <time.h>
#define ten(a) a a a a a a a a a a
#define hundred(a) ten(a) ten(a) ten(a) ten(a) ten(a) ten(a) ten(a) \
ten(a) ten(a) ten(a)
#define ITER 10000000
int main(void) {
uint64_t i=0;
uint64_t t=time(NULL);
while(1) {
for(int j=0; j<ITER;j++) {
hundred(asm volatile ("nop");)
}
i+=ITER*100;
printf("%lu/%lu\n", i, time(NULL)-t);
}
return 0;
}
Compiled assembly:
.file "gbloopinc.c"
.section .rodata
.LC0:
.string "%lu/%lu\n"
.text
.globl main
.type main, @function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $32, %rsp
movq $0, -16(%rbp)
movl $0, %edi
call time
movq %rax, -8(%rbp)
.L4:
movl $0, -20(%rbp)
jmp .L2
.L3:
#APP
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
# 15 "gbloopinc.c" 1
nop
# 0 "" 2
#NO_APP
addl $1, -20(%rbp)
.L2:
cmpl $9999999, -20(%rbp)
jle .L3
addq $1000000000, -16(%rbp)
movl $0, %edi
call time
subq -8(%rbp), %rax
movq %rax, %rdx
movq -16(%rbp), %rax
movq %rax, %rsi
movl $.LC0, %edi
movl $0, %eax
call printf
jmp .L4
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0 20160609"
.section .note.GNU-stack,"",@progbits
This has nothing to do with multiple cores. Cores are not "ports".
4 NOPs per clock is the issue/retirement pipeline width of your superscalar / out-of-order CPU. NOPs don't even need an execution unit / execution port (ALU or load or store), so you're not even limited by the number of integer execution units. Even Core2 (Intel's first 4-wide x86 CPU) could run 4 NOPs per clock.
As you guessed, this is an example of Instruction-level Parallelism. NOPs of course have no input dependencies.
On your Sandybridge CPU (with 3 ALU execution units per core), you could run 3 ADD and one load or store instruction per clock, since its pipeline width is 4 uops. See Agner Fog's microarch pdf and other links in the x86 tag wiki. On a stream of independent ADD instructions, like
add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx
...
you'd see about 3 per clock throughput on SnB, bottlenecking on integer ALU execution ports. Haswell could run this at 4 ADDs per clock, because it has a 4th ALU execution port that can handle non-vector integer ops (and branches).
Out-of-order CPUs typically have a wider front-end and issue/retire width than the number of execution units. Having more instructions decoded and ready to execute as soon as there's a free execution unit increases their utilization. Otherwise the out-of-order machinery could only see ahead of what's currently executing if execution stalled or slowed down due to serial dependencies. (e.g. add eax,eax
/ add eax,eax
needs the output of the first add as the input to the second add, so can only run at one insn per clock.)