I am trying to use a table with short jump offsets:
mov $4, %eax
j1:
movzbl offset(%eax),%edx # load jump offset
jmp *(%edx)
r1:
...
offset:
.byte 0, 1, 2, 3, 4 # Example values
Objdump shows the jump encoded as ff 22
which is not a short jump.
I also tried jmp *r1(%edx)
to jump to label r1
+ an offset based on what I saw in this question: On x86 assembly jump table, but gdb shows that takes me to somewhere completely different in memory.
Another idea is to read eip
and adding an offset manually as shown in this answer:
call get_eip
get_eip:
pop %eax
add %edx, %eax
Ideally the solution is as short as possible for the interest of code golf. So how can I specify a jump table to nearby sections of code while only using 1 byte per offset?
x86 doesn't have relative indirect jumps. You always have to compute (or load) the absolute target address.
jmp *(%edx)
uses %edx
as a pointer, and loads a new EIP value from the 32-bit location pointed to by %edx
. i.e. it's a memory-indirect jump.
So is jmp *r1(%edx)
. The code in the question you linked is jmp *operations(,%ecx,4)
, which loads a 32-bit target address from a table of pointers. (That's why it scales the index by 4.) If EIP was exposed as a general-purpose register, that jmp
would be mov r1(%edx), %eip
, so it's unsurprising that using 4 bytes of instructions as a point is not useful.
To compute a target address, you probably want to use a register-indirect jump, like jmp *%eax
. That sets EIP to the value of EAX, so the only memory access will be instruction fetch from the new address.
You're obviously using 32-bit mode, so you can't use a RIP-relative LEA for position-independent code. But if you can make your code position-dependent, you can use the address of a label as an immediate. You're using position-dependent addressing for offset(%eax)
already (32-bit absolute address as a disp32), so you might as well do that.
.section .rodata
jump_offset: .byte 0, .L2-.L1, .L3-.L1, ...
.section .text
# selector in EAX
movzbl jump_offset(%eax), %eax
add $.L1, %eax
jmp *%eax # EIP = EAX
# put the most common label first: when no branch-target prediction is available,
# the default prediction for an indirect jmp is fall-through.
.L1:
...
.L2:
...
.L3:
...
If each block is the same size (or you can pad it to the same size), you don't need a table at all; you can just scale the selector:
# selector in EAX
lea .L1(,%eax,8), %eax # or shift or multiply + add for other sizes
jmp *%eax
.p2align 3 # ideally arrange for this to be 0 bytes, by lengthening earlier instructions or padding earlier
.L1: ...
.p2align 3 # pad to a multiple of 8
.L2: ...
.p2align 3
.L3: ...
It doesn't have to be a power of 2 block size: lea .L1(%eax,%eax,8), %eax
to scale by 9 and add the base is probably better than wasting 7 bytes per block. But it means you can't use .p2align
anymore to help you make each block the same size. (I think GAS might be able to calculate padding the way NASM can (times 9-($-.L1) nop
to insert enough padding bytes to reach 9 bytes beyond .L1
. But single-byte NOPs suck if there's more than 1 and they're executed). Anyway I don't remember the GAS syntax.)
In 64-bit PIC code, lea .L1(%rip), %rdx
/ add %rax, %rdx
.
In 32-bit PIC code, use
call .LPIC_reference_point
.LPIC_reference_point:
pop %edx
movzbl jump_offsets - .LPIC_reference_point(%eax), %eax
add %edx, %eax
jmp *%eax
Or use the GOT for PIC access to static data the way compilers do (look at gcc -O3 -m32 -fPIE
output.)
(call +0
does not unbalance the return-address predictor stack on Intel P6 or SnB-family, or AMD K8/Bulldozer. So call
/pop
is safe to use. Henry doesn't have tests on Silvermont, though, and it does cause a mis-predicts Nano3000.)