I am reading http://www.realworldtech.com/sandy-bridge/ and I'm facing some problems in understanding some issues:
The dedicated stack pointer tracker is also present in Sandy Bridge and renames the stack pointer, eliminating serial dependencies and removing a number of uops.
What is a dedicated stack pointer tracker
actually?
For Sandy Bridge (and the P4), Intel still uses the term ROB. But it is critical to understand that, in this context, it only refers the status array for in-flight uops
What does it mean in fact? Please make it clear.
Like Agner Fog's microarch doc explains, the stack engine handles the rsp+=8
/ rsp-=8
part of push/pop / call/ret in the issue/rename stage of the pipeline (before issuing uops into the Out-of-Order (OoO) part of the core, the back-end).
So the back-end only has to handle the load/store part, with an address generated by the stack engine. It occasionally has to insert a uop to sync its offset from rsp
when the 8bit displacement counter overflows, or when the OoO back-end needs the value of rsp
directly (e.g. sub rsp, 8
, or mov [rsp-8], eax
after a call
, ret
, push
or pop
typically cause an extra uop to be inserted on Intel CPUs. AMD CPUs apparently don't need extra sync uops).
Note that Agner's instruction tables show that Pentium-M and later decode pop reg
to a single uop which runs only on the load port. But Pentium II/III decodes pop eax
to 2 uops; 1 ALU and 1 load, because there's no stack-engine to handle the ESP adjustment outside of the back-end. Besides taking extra uops, a long chain of push/pop and call/ret creates a serial dependency on ESP so out-of-order execution has to chew through the ALU uops before a value is available for a mov ebp, esp
, or an address for mov eax, [esp+16]
.
SnB-family microarchitectures (and P4) have a physical register file, so the ROB stores register numbers (i.e. a level of indirection) instead of the data directly. Re-Order Buffer is still an excellent name for that part of the CPU.
Note that SnB introduced AVX, with 256b vectors. Making every ROB entry big enough to store double-size vectors was presumably undesirable compared to only keeping them in a smaller FP register file.
SnB simplified the uop format to save power. This did lead to a sacrifice in uop micro-fusion capability, though: the decoders and uop-cache can still micro-fuse memory operands using 2-register (indexed) addressing modes, but they're "unlaminated" before issuing into the OoO back-end.