In the following assembly:
mov dx, word ptr [ebp+arg_0]
mov [ebp+var_8], dx
Thinking of this as an assembled C function, how many bits wide is (the argument to the C function) arg_0? How many bits wide is (the local C variable) var_8? That is to say, is it a short, an int, etc.
From this, it appears that var_8 is 16 bits, since dx is a 16-bit register. But I'm not sure about arg_0.
If the assembly also contains this line:
ecx, [ebp+arg_0]
Would that imply that arg_0 is a 32-bit value?
There are three principles to understand in order to tackle this question.
The assembler must be able to infer the correct length.
Though the Intel's syntax is not using a size suffix like the AT&T syntax the assembler still need a way to find the size of the operands.
The ambiguous instruction mov [var], 1
is written as movl $1, var
in AT&T syntax, if the size of the store is 32-bit (note the suffix l
), so it is easy to tell the size of the immediate operand.
The assembler that accepts the Intel syntax needs a way to infer this size, there are four widely used options:
mov [var], dx
is a 16-bit store. mov WORD [var], dx
PTR
after the size, because their size specifiers are only allowed on memory operands, not immediates or anywhere else.mov WORD [var], edx
is invalid). It is inferred from the context.
var db 0
mov [var], 1 ; MASM/TASM only. associate sizes with labels
MASM-syntax assemblers can infer that since var
is declared with db
its size is 8-bit and so is the store (by default).
This is the form I don't like because it makes the code harder to read (one good thing about assembly is the "locality" of the semantics of the instructions) and mix high-level concepts like types with low-level concepts like store sizes. That's why NASM's syntax doesn't support magical / non-local size association.
push
, branches and all the instructions where their operand size depends on the memory model or code size.push word 123
vs. push 123
)
To put it short, there must be a way for the assembler to tell the size, otherwise it will reject the code. (Or some low quality assemblers like emu8086 have a default operand size for ambiguous cases.)
If you are looking at a disassembled code, disassemblers usually take the safe side and always state the size explicitly.
If not, you must resort to manual inspection of the opcode, if the disassembler won't show the opcodes, it is time to change it.
The disassembler has no trouble finding out the size of the operand as the binary code it is disassembling is the same executed by the CPU and the instructions opcodes encode the operand size.
The C language is intentionally loose on how C types map to the number of bits
It's not futile to try to infer the type of a variable from the disassembly but one must take into consideration the platform too, not only the architecture.
The main models used are discussed here:
Datatype LP64 ILP64 LLP64 ILP32 LP32 char 8 8 8 8 8 short 16 16 16 16 16 _int32 32 int 32 64 32 32 16 long 64 64 32 32 32 long long 64 [64] pointer 64 64 64 32 32
Windows on x86_64 uses LLP64. Other OSes on x86-64 typically use the x86-64 System V ABI, an LP64 model.
Assembly doesn't have types and programmers can exploit that
Even compilers can exploit that.
In the case linked a bar
variable of type long long
(64-bit) is ORed with 1, clang
spares a REX prefix by ORing only the low byte. This causes a store-forwarding stall if the variable is reloaded again right away with two dword loads or one qword, so it's probably not a good choice, especially in 32-bit mode where or dword [bar], 1
is the same size and it's likely to be reloaded as two 32-bit halves.
If one would look at the disassembled code incautiously they could infer that bar
is 8-bit.
This kind of tricks, where a variable or an object, are accessed partially are common.
In order to correctly guess the size of a variable it takes a bit of expertise.
For example, structures members are usually padded, so there is unused space between them that may fool the inexperienced user into thinking that each member is bigger than it is.
The stack has precise alignment requirements that also may make widen the parameters size.
The rule of thumb is that compilers generally prefer to keep the stack 16-byte aligned, and naturally-align all variables. Multiple narrow variables are packed into a single dword. When passing function args via the stack, each one is padded to 32 or 64-bit, but that doesn't apply to the layout of locals on the stack.
To finally answer your question
Yes, from the first snippet of code you can assume that the value of arg_0
is 16-bit wide.
Note that since it's a function arg passed on the stack, it is actually 32-bit but the upper 16 bits are not used.
If a mov ecx, [ebp+arg_0]
appeared later in the code than you would have to revisit your guess about the size of the value of arg_0
, it is certainly at least 32-bit.
It is unlikely that it is 64-bit (64-bit type are rare in 32-bit code, we can make this bet) so we can conclude it is 32-bit.
Evidently, the first snippet was one of those tricks that only uses a part of a variable.
That's how you deal with reverse engineering a size of a var, you make a guess, verify it is consistent with the rest of the code, revisit it if not, repeat.
With time you'll make mostly good guesses that need no revision at all.