Search code examples
performanceparallel-processingcpu-architecture

Do single threaded programs execute in parallel in a CPU?


While measuring the CPI (cycles per instruction) of an Intel 4th Gen i5 we got a CPI < 1.

Some of the classmates suggested that it was due to parallel execution of the code, but it was a single-threaded code in C, the teacher said that nowadays processors are superscalar.

The compile was done with gcc -m32.

Assuming the compiler is not doing magic by parallelizing the code.

But my doubts still remain. Since processors nowadays do some little magic with the code, say out-of-order execution and speculative execution I wonder if:

  • Do processors run in multiple core single-threaded programs?

Lets say we have these two instructions:

(1) addl %eax, (%eax) (1) addl %ebx, (%ebx)

Core-0 runs (1) and Core-1 runs (2)


Solution

  • Yes, CPUs find instruction-level parallelism within a single thread to run more than 1 instruction per cycle. See Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop? for a specific example.

    Instruction-level parallelism is unrelated to thread-level parallelism (which you bring up in the 2nd part of your question). When running a single-threaded workload, only one core is active.

    Modern multi-core systems exploit both, but you can have one without the other.

    For example Sun's Niagara (UltraSPARC T1) is designed from the ground up to exploit thread-level parallelism (and memory-level parallelism) but not try to run any single thread fast, like some kind of server workloads. It has 8 physical single-issue in-order cores, and 4-way SMT (4 logical cores per physical core) instead of OoO exec to hide latency (e.g. cache misses). Single-threaded performance is crap, but max throughput running 32 threads was good for 2005 with its power budget and transistor count.

    Earlier x86, like Pentium III, were superscalar single-core. Only multi-socket systems were SMP. But such CPUs could and did achieve CPI < 1.

    Your i5 4th-gen CPU is Haswell. See David Kanter's deep dive on Haswell's microarchitecture, including block diagrams of how wide various stages are inside each core.

    Do processors run in multiple core single-threaded programs?

    NO, a single core is superscalar on its own, e.g. having 4 integer ALU execution units in Haswell or Zen. (And on Intel CPUs, 3 SIMD ALU execution units which are on the same execution ports as the scalar / general-purpose integer ALUs.) And a front-end wide enough to match.

    In general, superscalar CPUs are capable of running at least 2 instructions per clock per core.

    This wrong guess in your question is a duplicate of How does a single thread run on multiple cores? on programmers.SE. Answer: they don't; each core has a wide front-end and multiple execution units in the back-end.

    Until we hit diminishing returns on making a single core wider, we kept doing that instead of building multi-core CPUs; time slicing / pre-emptive multitasking was generally good enough. One faster core is better than N cores of 1/N speed for almost everything. But these days that's not the tradeoff; it's N cores of 1/sqrt(N) speed or something like.


    • addl %eax, (%eax)
    • addl %ebx, (%ebx)

    These memory-destination add instruction take more than 1 cycle each to complete (and decode to at least 2 uops each on modern Intel: load+add micro-fused, and store (micro-fused store-address + store-data). The load+add parts can both start in the same cycle if they ran on the same physical core.

    Ice Lake could also execute both stores in the same cycle, but before that modern x86 CPUs only do 1 store per clock. (e.g. Intel from Haswell through Coffee Lake can do 2 loads + 1 store per clock cycle. SnB/IvB can do address-generation for 2 memory operations per cycle, and sustain the throughput if up to one of them is a store. With a special 2+1 case for 256-bit vectors that reuse the same address-generation for 2 cycles of data.)

    Unless EAX and EBX hold the same pointer value, these instruction access different memory and different registers, and are fully independent except for resource conflicts for execution units (load, add, store). (Register renaming handles the write-after-write hazard for the FLAGS output).