In X86, What type of physical internal registers a CPU uses for XMM type registers. Would that be integer or vector physical registers?
I think vector registers are used because XMM registers are 128-bit registers. Any confirmation is appreciated.
XMM registers are vector registers. They're renamed onto the FP/SIMD register file, not (general-purpose) integer, regardless of whether you're using SIMD-integer or SIMD-fp instructions.
https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/ shows how to approximately measure the capacities of the physical register files for integer vs. SIMD, since those can be a smaller limit than ReOrder Buffer size for hiding cache-miss latency.
Intel since Sandybridge and AMD since even longer ago have renamed registers onto physical register files, with separate ones for general-purpose integer vs. SIMD/FP.
https://www.realworldtech.com/sandy-bridge/5/ shows that Sandybridge's SIMD PRF has has 144 entries, vs. 160 entries in the general-purpose integer PRF. (vs. P6 family, Nehalem and earlier, not using a separate PRF, but keeping register values directly in the ROB). vs. Skylake with 180 entries in the integer PRF vs. 168 in the SIMD PRF https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Scheduler
Skylake splits further, with a separate register file for renaming 80-bit x87/MMX and AVX-512 mask registers (k0..7), separate from the 512-bit entries in the vector register file. https://travisdowns.github.io/blog/2020/05/26/kreg2.html
Also related:
addsd xmm0, xmm1
still uses SIMD registers and execution units.)For more about x86 CPU internals, see Agner Fog's microarch guide on https://agner.org/optimize/ and other links in https://stackoverflow.com/tags/x86/info
Also for good measure, Modern Microprocessors A 90-Minute Guide! is a good read, covering a lot of good general stuff about design considerations in modern CPUs.
For example,
ADDPD XMM1, XMM2
. I'll reiterate the question as will this instruction be scheduled on vector units or regular INT based units?
The uop for that instruction will run on a SIMD-FP execution unit, after the CPU reads its inputs from the appropriate register file or forwards one or both from a previous instruction.
On Intel CPUs, execution ports have both SIMD and integer execution units, so it can compete with add eax, ecx
throughput. See https://www.realworldtech.com/haswell-cpu/4/ for Haswell vs. Sandybridge execution unit distribution. (Alder Lake added yet another execution port with just integer. See https://uops.info/ and Agner Fog's guides.)
On AMD CPUs, there are a separate group of SIMD/FP execution ports, independent from the integer execution ports. See a Zen 2 diagram for example: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram So if a bunch of instructions are waiting for inputs that finally become ready, a Zen core can begin executing 4 integer and 4 FP/SIMD uops in the same cycle. Also some loads+stores. (The front-end is "only" 5 instructions or 6 uops wide, so it can't sustain that.)