assembly x86-64 disassembly machine-code micro-architecture

How to look up what form of an instruction is used, by opcode or disassembly?

Sites like https://uops.info/ and Agner Fog's instruction tables, and even Intel's own manuals, list various forms of the same instruction. For example add m, r (in Agner's tables) or add (m64, r64) on uops.info, or ADD r/m64, r64 in Intel's manual (https://www.felixcloutier.com/x86/add).

Here's a simple example I ran on godbolt

__thread int a;
void Test() {
    a+=5;
}

The add is add DWORD PTR fs:0xfffffffffffffffc,0x5. It starts with the opcodes 64 83 04 25.

There's a few ways to write my real code but I wanted to lookup how many cycles this might take and other information. How the heck do I find the reference to this instruction? I tried https://uops.info/table.html typing in "add" and checking off my architecture. But I have no idea which one of the entries is the instruction that's being used.

For now in this specific case I'm guessing the opcode is Add m64, r64 but I have no idea if there's any penalty for using fs: before the address or if there's a way to see opcodes so I can confirm I'm looking at the right reference

Solution

http://ref.x86asm.net/coder64.html has an opcode map, but with a bit of experience you won't need one most of the time. Especially when you have disassembly, you can just check the manual entry for that mnemonic (https://www.felixcloutier.com/x86/add), and see which of the possible opcodes it is (83 /0 add r/m32, imm8).

Clearly this has a 32-bit operand-size (dword ptr) memory destination, and the source is an immediate (numeric constant). That rules out a , r64 register source for 2 separate reasons. So even without looking at the machine code, it's definitely add r/m32, imm with an imm8 or imm32. Any sane assembler will of course pick imm8 for a small constant that fits in a signed 8-bit integer.

Generally different ways of encoding the same instruction aren't special, so the source-level assembly / disassembly is fine, as long as you understand what's a register, what's memory, and what's an immediate.

But there are a few special cases, e.g. Agner Fog's guide notes that rotates by 1 using the short-form encoding are slower than rol reg, imm8 even when the imm8=1, because the flag-updating special case for rotate-by-1 actually depends on the opcode, not the immediate count. (Intel's documentation apparently assumes your assembler will always pick the short-form for rotate by constant 1. The part about "masked count" may only apply to rotate by cl. https://www.felixcloutier.com/x86/rcl:rcr:rol:ror#flags-affected. I haven't tested this recently and am not 100% sure I'm remembering correctly when OF is updated (but other flags in the SPAZO group are always left unmodified), but IIRC that's why rotates by 1 (2 uops) and by cl (3 uops) are slow, vs. rotates by other immediate counts (1 uop) on Intel).

Or https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks. Specifically I mean Which Intel microarchitecture introduced the ADC reg,0 single-uop special case? - even on Haswell / Skylake, adc al,0 (using the short form with no modrm byte) is 2 uops, and so is the equivalent adc eax, 12345. But adc edx, 12345 is 1 uop using the non-special case.) Then you have to either check the machine code, or know how your assembler will have chosen to encode a given instruction. (Optimizing for size).

BTW, using a segment with a non-zero base adds 1 cycle of latency to address-generation, IIRC, but aren't a significant throughput penalty. (Unless of course throughput bottlenecks on a latency chain that it's part of...)