In Intel optimization manual Chapter 3 Section 3.5.1, it says
If you need to use multiple micro-op, non-microsequenced instructions, try to separate by a fewsingle micro-op instructions. The following instructions are examples of multiple micro-op instruction not requiring micro-sequencer:
ADC/SBB
CMOVcc
Read-modify-write instructions
Since "Micro Instructions Sequencer is a combination of all hardware for selecting the next micro-instruction address"(Implementation of Micro Instructions Sequencer).
That geeksforgeeks article you found is about much simpler CPUs like 8086 or 6502 where all instructions are handled by a microcode sequencer. Modern Intel and AMD CPUs decode simple-enough instructions directly to 1 or 2 uops (AMD) or 1 to 4 uops (Intel).
See also What is a microcoded instruction? about modern CPUs; it also has a section at the bottom about old-style "microcoded CPUs" like 6502; see a Q&A on retrocomputing for more about them.
See another section of Intel's manual where they document the pipeline of SnB-family CPUs, and explain that any instruction that runs as more than 4 uops decodes to a special kind of uop that triggers the microcode sequencer.
See https://www.realworldtech.com/sandy-bridge/4/ and especially Agner Fog's micoarch guide (https://agner.org/optimize/), especially the Pentium Pro and Sandybridge sections. Quoting from the SnB section:
There are four decoders, which can handle instructions generating one or more µops according to certain patterns. The following instruction patterns were successfully decoded in a single clock cycle in my experiments:
- 1-1-1-1
- 2-1-1
- 3
- 4
Instructions that generate 3 or 4 µops are decoded alone. Instructions that generate more than four µops are handled by microcode which is less efficient.
This is what Intel's optimization guide is talking about; what many other people call "microcoded instructions" because they indirect to the MS-ROM.
Perf events like idq.ms_switches
are relevant: [Number of switches from DSB (Decode Stream Buffer) or MITE (legacy
decode pipeline) to the Microcode Sequencer]. The DSB is the uop cache.
If looking at other perf events from perf list
, read carefully: events like the following are not counting uops generated by the microcode sequencer. They're counting uops added to the IDQ while the issue/rename stage is reading from the microcode sequencer instead of the IDQ.
idq.ms_cycles
[Cycles when uops are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy]
idq.ms_dsb_cycles
[Cycles when uops initiated by Decode Stream Buffer (DSB) are being delivered to Instruction Decode Queue (IDQ) while Microcode Sequenser (MS) is busy]
At least that's my understanding, which is compatible with the fact that microcoded instructions like rep movsb
need feedback from execution to know when to stop generating uops, so they have to get expanded at the issue/rename end of the IDQ (instruction decode queue), not at the end where newer uops are added from legacy decode or the uop cache. (In David Kanter's RWT article about SnB, the IDQ is the 28-uop Decoder Queue in the block diagram. It shows a "ucode engine" writing to the IDQ, but I'm not sure that's accurate.)
I wrote about that detail of when the indirection to microcode is expanded in another answer: How are microcodes executed during an instruction cycle?
Others related:
Branch alignment for loops involving micro-coded instructions on Intel SnB-family CPUs a performance effect in the front-end due to uop cache, it seems.
Conditional jump instructions in MSROM procedures? - some info from Andy Glew about how MS-ROM code works and its limitations for implementing stuff like rep movsb
.