Why use Push/Pop instead of Mov to put a number in a register in shellcode?

I have some sample code from a shell code payload showing a for loop and using push/pop to set the counter:

push 9
pop ecx

Why can it not just use mov?

mov ecx, 9

Solution

Yes normally you should always use mov ecx, 9 for performance reasons. It runs more efficiently than push/pop, as a single-uop instruction that can run on any port. (This is true across all existing CPUs that Agner Fog has tested: https://agner.org/optimize/)

The normal reason for push imm8 / pop r32 is that the machine code is free of zero bytes. This is important for shellcode that has to overflow a buffer via strcpy or any other method that treats it as part of an implicit-length C string terminated by a 0 byte.

mov ecx, immediate is only available with a 32-bit immediate, so the machine code will look like B9 09 00 00 00. vs. 6a 09 push 9 ; 59 pop ecx.

(ECX is register number 1, which is where B9 and 59 come from: the low 3 bits of the instruction = 001)

The other use-case is purely code-size: mov r32, imm32 is 5 bytes (using the no ModRM encoding that puts the register number in the low 3 bits of the opcode), because x86 unfortunately lacks a sign-extended imm8 opcode for mov (there's no mov r/m32, imm8). That exists for nearly all ALU instructions that date back to 8086.

In 16-bit 8086, that encoding wouldn't have saved any space: the 3-byte short-form mov r16, imm16 would be just as good as a hypothetical mov r/m16, imm8 for almost everything, except moving an immediate to memory where the mov r/m16, imm16 form (with a ModRM byte) is needed.

Since 386's 32-bit mode didn't add new opcodes specific to that mode, just changed the default operand-size and immediate widths, this "missed optimization" in the ISA in 32-bit mode started with 386. With full-width immediates being 2 bytes longer, an add r32,imm32 is now longer than an add r/m32, imm8. See x86 assembly 16 bit vs 8 bit immediate operand encoding. But we don't have that option for mov because there's no MOV opcode that sign-extends (or zero-extends) its immediate.

Fun fact: clang -Oz (optimize for size even at the expense of speed) will compile int foo(){return 9;} to push 9 ; pop rax. GCC12 also supports a similar -Oz.

See also Tips for golfing in x86/x64 machine code on Codegolf.SE (a site about optimizing for size usually for fun, rather than to fit code into a small ROM or boot sector. But for machine code, optimizing for size does have practical applications sometimes, even at the expense of performance.)

If you already had another register with known contents, creating 9 in another register can be done with 3-byte lea ecx, [eax-0 + 9] (if EAX holds 0). Just Opcode + ModRM + disp8. So you can avoid the push/pop hack if you already were going to xor-zero any other register. lea is barely less efficient than mov, and you could consider it when optimizing for speed because smaller code-size has minor speed benefits in the large scale: L1i cache hits, and sometimes decode if the uop cache isn't already hot.