Short jump offset table usage

I am trying to use a table with short jump offsets:

        mov     $4, %eax           

j1: 
        movzbl  offset(%eax),%edx   # load jump offset 
        jmp     *(%edx)

r1:
        ...


offset:
        .byte   0, 1, 2, 3, 4       # Example values

Objdump shows the jump encoded as ff 22 which is not a short jump.

I also tried jmp *r1(%edx) to jump to label r1 + an offset based on what I saw in this question: On x86 assembly jump table, but gdb shows that takes me to somewhere completely different in memory.

Another idea is to read eip and adding an offset manually as shown in this answer:

    call get_eip
get_eip:
    pop %eax
    add %edx, %eax

Ideally the solution is as short as possible for the interest of code golf. So how can I specify a jump table to nearby sections of code while only using 1 byte per offset?

Solution

x86 doesn't have relative indirect jumps. You always have to compute (or load) the absolute target address.

jmp *(%edx) uses %edx as a pointer, and loads a new EIP value from the 32-bit location pointed to by %edx. i.e. it's a memory-indirect jump.

So is jmp *r1(%edx). The code in the question you linked is jmp *operations(,%ecx,4), which loads a 32-bit target address from a table of pointers. (That's why it scales the index by 4.) If EIP was exposed as a general-purpose register, that jmp would be mov r1(%edx), %eip, so it's unsurprising that using 4 bytes of instructions as a point is not useful.

To compute a target address, you probably want to use a register-indirect jump, like jmp *%eax. That sets EIP to the value of EAX, so the only memory access will be instruction fetch from the new address.

You're obviously using 32-bit mode, so you can't use a RIP-relative LEA for position-independent code. But if you can make your code position-dependent, you can use the address of a label as an immediate. You're using position-dependent addressing for offset(%eax) already (32-bit absolute address as a disp32), so you might as well do that.

.section .rodata
    jump_offset: .byte 0, .L2-.L1,  .L3-.L1,  ...

.section .text
    # selector in EAX
    movzbl  jump_offset(%eax), %eax
    add     $.L1, %eax
    jmp     *%eax                # EIP = EAX
    # put the most common label first: when no branch-target prediction is available,
    # the default prediction for an indirect jmp is fall-through.
.L1:
    ...

.L2:
  ...

.L3:     
  ...

If each block is the same size (or you can pad it to the same size), you don't need a table at all; you can just scale the selector:

    # selector in EAX
    lea     .L1(,%eax,8), %eax  # or shift or multiply + add for other sizes
    jmp     *%eax

.p2align 3     # ideally arrange for this to be 0 bytes, by lengthening earlier instructions or padding earlier
.L1: ...

.p2align 3     # pad to a multiple of 8
.L2: ...

.p2align 3
.L3: ...

It doesn't have to be a power of 2 block size: lea .L1(%eax,%eax,8), %eax to scale by 9 and add the base is probably better than wasting 7 bytes per block. But it means you can't use .p2align anymore to help you make each block the same size. (I think GAS might be able to calculate padding the way NASM can (times 9-($-.L1) nop to insert enough padding bytes to reach 9 bytes beyond .L1. But single-byte NOPs suck if there's more than 1 and they're executed). Anyway I don't remember the GAS syntax.)

In 64-bit PIC code, lea .L1(%rip), %rdx / add %rax, %rdx.

In 32-bit PIC code, use

    call .LPIC_reference_point
.LPIC_reference_point:
    pop   %edx
    movzbl jump_offsets - .LPIC_reference_point(%eax), %eax
    add   %edx, %eax
    jmp   *%eax

Or use the GOT for PIC access to static data the way compilers do (look at gcc -O3 -m32 -fPIE output.)

(call +0 does not unbalance the return-address predictor stack on Intel P6 or SnB-family, or AMD K8/Bulldozer. So call/pop is safe to use. Henry doesn't have tests on Silvermont, though, and it does cause a mis-predicts Nano3000.)