I'm trying to implement an algorithm in assembly (MASM64, Windows, x64) using jump tables. Basic idea is: there are 3 different types of operations I need to do with data. The operations depend on some variables, but I found it tedious to implement a lot of switching and many long implementations.
PUBLIC superFunc@@40 ;__vectorcall decoration
.DATA
ALIGN 16
jumpTable1 qword func_11, func_12, func_13, func_14
jumpTable2 qword func_21, func_22, func_23, func_24
jumpTable3 qword func_31, func_32, func_33, func_34
.CODE
superFunc@@40 PROC
;no stack actions, as we should do our stuff as a leaf function
;assume the first parameter (rcx) is our jumpTable index, and it's
;the same index for all functions
mov rax, qword ptr [rcx*8 + offset jumpTable1]
mov r10, qword ptr [rcx*8 + offset jumpTable2]
mov r11, qword ptr [rcx*8 + offset jumpTable3]
jmp qword ptr [rax]
superFunc@@40 ENDP
func_11:
[...] do something with data
jmp qword ptr [r10]
func_12: ; shorted, simply does something else to the data and jumps thru r10
[...]
func_21:
[...] do something with data
jmp qword ptr [r11]
func_22: ; shorted, simply does something else to the data and jumps thru r11
[...]
func_31:
[...] do something with data
ret
func_32: ; shorted, simply does something else to the data and returns
END
Now this compiles well, but it doesn't link with my main C++ Plugin (a DLL), giving me the following linker errors:
LINK : warning LNK4075: ignoring '/LARGEADDRESSAWARE:NO' due to '/DLL' specification
error LNK2017: 'ADDR32' relocation to 'jumpTable1' invalid without /LARGEADDRESSAWARE:NO
How can I implement something like this correctly? Maybe better phrased: How do I implement jump tables and jumping/calling to addresses from those tables correctly in MASM64?
P.S.: I could set up a function table in C++ and tell the superFunc about it via a parameter. That would be what I will do if I don't find a better solution.
RIP-relative addressing only works when there are no other registers in the addressing mode.
[table + rcx*8]
can only be encoded in x86-64 machine code as [disp32 + rcx*8]
, and thus only works with non-large addresses that fit in a 32-bit signed absolute address. Windows can apparently support this with LARGEADDRESSAWARE:NO
, like on Linux compiling with -no-pie
to solve the same problem.
MacOS has no workaround for it, you can't use 64-bit absolute addressing at all there. Mach-O 64-bit format does not support 32-bit absolute addresses. NASM Accessing Array shows how to index a static array using a RIP-relative lea
to get the table address into a register while avoiding 32-bit absolute addresses.
Your jump tables themselves are fine: they use 64-bit absolute addresses which can be relocated anywhere in virtual address space. (Using load-time fixups after ASLR.)
I think you have one too many levels of indirection. Since you already load a function pointer into a register, you should be using jmp r10
not jmp [r10]
. Doing all the loads into registers up front gets them in the pipeline sooner, before any possible branch mispredicts, so is maybe a good idea if you have lots of registers to spare.
Much better would be inlining some of the later blocks, if they're small, because the blocks reachable by any given RCX value aren't reachable any other way. So it would be much better to inline all of func_21
and func_31
into func_11
, and so on for func_12
. You might use assembler macros to make this easier.
Actually what matters is just that the jump at the end of func_11
always goes to func_21
. It's fine of there are other ways to reach that block, e.g. from other indirect branches that skip table 1. That's no reason for func_11
not to fall into it; it only limits what optimizations you can make between those 2 blocks if func_21
still has to be a valid entry point for execution paths that didn't fall through from func_11
.
But anyway, you can implement your code like this. If you do optimize it, you can remove the later dispatching steps and the corresponding loads.
I think this is valid MASM syntax. If not, it should still be clear what the desired machine-code is.
lea rax, [jumpTable1] ; RIP-relative by default in MASM, like GAS [RIP + jumpTable1] or NASM [rel jumpTable1]
; The other tables are at assemble-time-constant small offsets from RAX
mov r10, [rax + rcx*8 + jumpTable3 - jumpTable1]
mov r11, [rax + rcx*8 + jumpTable2 - jumpTable1]
jmp [rax + rcx*8]
func_11:
...
jmp r10 ; TODO: inline func_21 or at least use jmp func_21
; you can use macros to help with either of those
Or if you only want to tie up a single register for one table, maybe use:
lea r10, [jumpTable1] ; RIP-relative LEA
lea r10, [r10 + rcx*8] ; address of the function pointer we want
jmp [r10]
align 8
func_11:
...
jmp [r10 + jumpTable2 - jumpTable1] ; same index in another table
align 8
func_12:
...
jmp [r10 + jumpTable3 - jumpTable1] ; same index in *another* table
This takes full advantage of the known static offsets between tables.
Cache locality for the jump targets
In your matrix of jump targets, any single usage strides down a "column" to follow some chain of jumps. It would obviously be better to transpose your layout so that one chain of jumps goes along a "row", so so all the targets come from the same cache line.
i.e. arrange your table so func_11
and 21
can end with jmp [r10+8]
, and then jmp [r10+16]
, instead of + some offset between tables, for improved spatial locality. L1d load latency is only a few cycles so there's not much extra delay for the CPU in check the correctness of branch prediction, vs. if you'd loaded into registers ahead of the first indirect branch. (I'm considering the case where the first branch mispredicts, so OoO exec can't "see" the memory-indirect jmp until after the correct path for that starts to issue.)
You can also store 32-bit (or 16 or 8-bit) offsets relative to some reference address that's near the jump targets, or relative to the table itself.
For example, look at what GCC does when compiling switch
jump tables in position-independent code, even for targets that do allow runtime fixups of absolute addresses.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 includes a testcase; see it on Godbolt with GCC's MASM-style .intel_syntax
. It uses a movsxd
load from the table, then add rax, rdx
/ jmp rax
. The table entries are things like dd L27 - L4
and dd L25 - L4
(where those are label names, giving the distance from a jump target to the "anchor" L4).
(Also related for that case https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585).