Search code examples
assemblyx86intintelshort

Intel X86 Assembly: How to tell many bits wide is an argument?


In the following assembly:

mov     dx, word ptr [ebp+arg_0]
mov     [ebp+var_8], dx

Thinking of this as an assembled C function, how many bits wide is (the argument to the C function) arg_0? How many bits wide is (the local C variable) var_8? That is to say, is it a short, an int, etc.

From this, it appears that var_8 is 16 bits, since dx is a 16-bit register. But I'm not sure about arg_0.

If the assembly also contains this line:

ecx, [ebp+arg_0]

Would that imply that arg_0 is a 32-bit value?


Solution

  • There are three principles to understand in order to tackle this question.

    1. The assembler must be able to infer the correct length.
      Though the Intel's syntax is not using a size suffix like the AT&T syntax the assembler still need a way to find the size of the operands.
       
      The ambiguous instruction mov [var], 1 is written as movl $1, var in AT&T syntax, if the size of the store is 32-bit (note the suffix l), so it is easy to tell the size of the immediate operand.
      The assembler that accepts the Intel syntax needs a way to infer this size, there are four widely used options:

      • It is inferred from the other operand.
        This is the case when a register is involved, for example.
        E.g. mov [var], dx is a 16-bit store.
      • It is stated explicitly.
        mov WORD [var], dx
        MASM-syntax assemblers need a PTR after the size, because their size specifiers are only allowed on memory operands, not immediates or anywhere else.
        This is the form I prefer because it is clear, it stands out and it is a bit less error-prone (mov WORD [var], edx is invalid).
      • It is inferred from the context.

         var db 0
        
         mov [var], 1   ; MASM/TASM only.   associate sizes with labels 
        

        MASM-syntax assemblers can infer that since var is declared with db its size is 8-bit and so is the store (by default).
        This is the form I don't like because it makes the code harder to read (one good thing about assembly is the "locality" of the semantics of the instructions) and mix high-level concepts like types with low-level concepts like store sizes. That's why NASM's syntax doesn't support magical / non-local size association.

      • There is only one correct size the vast majority of times
        This is the case with push, branches and all the instructions where their operand size depends on the memory model or code size.
        The actual size used can be overridden for some instructions, but the default is a sensible choice. (e.g. push word 123 vs. push 123)

       
      To put it short, there must be a way for the assembler to tell the size, otherwise it will reject the code. (Or some low quality assemblers like emu8086 have a default operand size for ambiguous cases.)

      If you are looking at a disassembled code, disassemblers usually take the safe side and always state the size explicitly.
      If not, you must resort to manual inspection of the opcode, if the disassembler won't show the opcodes, it is time to change it.
      The disassembler has no trouble finding out the size of the operand as the binary code it is disassembling is the same executed by the CPU and the instructions opcodes encode the operand size.
       

    2. The C language is intentionally loose on how C types map to the number of bits
       
      It's not futile to try to infer the type of a variable from the disassembly but one must take into consideration the platform too, not only the architecture.
      The main models used are discussed here:

      Datatype    LP64    ILP64   LLP64   ILP32   LP32
      char        8       8       8       8       8
      short       16      16      16      16      16
      _int32      32          
      int         32      64      32      32      16
      long        64      64      32      32      32
      long long                   64      [64]                    
      pointer     64      64      64      32      32
      

      Windows on x86_64 uses LLP64. Other OSes on x86-64 typically use the x86-64 System V ABI, an LP64 model.

    3. Assembly doesn't have types and programmers can exploit that
       
      Even compilers can exploit that.
       
      In the case linked a bar variable of type long long (64-bit) is ORed with 1, clang spares a REX prefix by ORing only the low byte. This causes a store-forwarding stall if the variable is reloaded again right away with two dword loads or one qword, so it's probably not a good choice, especially in 32-bit mode where or dword [bar], 1 is the same size and it's likely to be reloaded as two 32-bit halves.
      If one would look at the disassembled code incautiously they could infer that bar is 8-bit.
      This kind of tricks, where a variable or an object, are accessed partially are common.
       
      In order to correctly guess the size of a variable it takes a bit of expertise.
      For example, structures members are usually padded, so there is unused space between them that may fool the inexperienced user into thinking that each member is bigger than it is.
      The stack has precise alignment requirements that also may make widen the parameters size.
       
      The rule of thumb is that compilers generally prefer to keep the stack 16-byte aligned, and naturally-align all variables. Multiple narrow variables are packed into a single dword. When passing function args via the stack, each one is padded to 32 or 64-bit, but that doesn't apply to the layout of locals on the stack.

    To finally answer your question

    Yes, from the first snippet of code you can assume that the value of arg_0 is 16-bit wide.
    Note that since it's a function arg passed on the stack, it is actually 32-bit but the upper 16 bits are not used.

    If a mov ecx, [ebp+arg_0] appeared later in the code than you would have to revisit your guess about the size of the value of arg_0, it is certainly at least 32-bit.
    It is unlikely that it is 64-bit (64-bit type are rare in 32-bit code, we can make this bet) so we can conclude it is 32-bit.
    Evidently, the first snippet was one of those tricks that only uses a part of a variable.

    That's how you deal with reverse engineering a size of a var, you make a guess, verify it is consistent with the rest of the code, revisit it if not, repeat.
    With time you'll make mostly good guesses that need no revision at all.