Search code examples
assemblycompilationarmmachine-codethumb

How to generate the machine code of Thumb instructions?


I searched Google for generating machine code of ARM instructions, such as this one Converting very simple ARM instructions to binary/hex

The answer referenced ARM7TDMI-S Data Sheet (ARM DDI 0084D). The diagram of data processing instructions is good enough. Unfortunately, it's for ARM instructions, not for Thumb/Thumb-2 instructions.

Take the B instruction as an example. ARM Architecture Reference Manual - ARMv7-A and ARMv7-R edition section A8.8.18, Encoding T4:

B instruction, encoding T4

For the assembly code:

B 0x50

How can I encode the immediate value 0x50 into the 4-byte machine code? Or if I want to write a C function that takes the B instruction and the as inputs, and return the encoded machine code. How can I implement such a function?

unsigned int gen_mach_code(int instruction, int relative_addr)
{
    /* the int instruction parameter is assumed to be B */
    /* encoding method is assumed to be T4 */
    unsigned int mach_code;
    /* construc the machine code of B<c>.W <label> */
    return mach_code;
}

I know the immediate values encoding on ARM. Here http://alisdair.mcdiarmid.org/arm-immediate-value-encoding/ is a good tutorial.

I just want to know where is the imm10 and imm11 from, and how to construct the full machine code with them.


Solution

  • First and foremost the ARM7TDMI does not support the thumb2 extentions, instead it basically defines the original thumb instruction set.

    so why not just try it?

    .thumb
    @.syntax unified
    
    b 0x50
    

    run these commands

    arm-whatever-whatever-as b.s -o b.o
    arm-whatever-whatever-objdump -D b.o
    

    get this output

    0:  e7fe        b.n 50 <*ABS*0x50>
    

    so that is a T2 encoding and as the newer docs show for this instruction that is supported by ARMv4T, ARMv5T*, ARMv6*, ARMv7 the ARM7TDMI is an ARMv4t

    so we see that E7 matches the 11100 start of that instruction definition so the imm11 is 0x7FE. which is basically an encoding of branch to the address 0x000 since this isnt linked with anything. how do I know that?

    .thumb
    b skip
    nop
    nop
    nop
    nop
    nop
    skip:
    
    00000000 <skip-0xc>:
       0:   e004        b.n c <skip>
       2:   46c0        nop         ; (mov r8, r8)
       4:   46c0        nop         ; (mov r8, r8)
       6:   46c0        nop         ; (mov r8, r8)
       8:   46c0        nop         ; (mov r8, r8)
       a:   46c0        nop         ; (mov r8, r8)
    

    0xe004 starts with 11100 so that is a branch encoding T2. imm11 is a 4

    we need to reach from 0 to 0xC. the pc is two INSTRUCTIONS ahead when the offset is applied. The docs say

    Encoding T2 Even numbers in the range –2048 to 2046
    

    and

    PC, the program counter 
    - When executing an ARM instruction, PC reads as the address of the current instruction plus 8. • When executing a
    - Thumb instruction, PC reads as the address of the current instruction
    plus 4.
    

    so that all makes sense. 0xC-0x4 = 8. we can only do evens and it makes no sense to branch into the middle of an instruction anyway so divide by 2 because thumb instructions are two bytes (offset is in instructions not bytes). so that gives a 4

    0xE004
    

    here is one way to generate the t4 encoding

    .thumb
    .syntax unified
    
    b skip
    nop
    nop
    nop
    nop
    nop
    skip:
    
    00000000 <skip-0xe>:
       0:   f000 b805   b.w e <skip>
       4:   46c0        nop         ; (mov r8, r8)
       6:   46c0        nop         ; (mov r8, r8)
       8:   46c0        nop         ; (mov r8, r8)
       a:   46c0        nop         ; (mov r8, r8)
       c:   46c0        nop         ; (mov r8, r8)
    

    T4 encoding of branch is 11110 on top of the first halfword indicating this is either an undefined instruction (anything not ARMv6T2, ARMv7) or a thumb2 extension for ARMv6T2, ARMv7

    second halfword 10x1 and we see a B so looks good this is a thumb2 extended branch.

    S is a 0 imm10 is 0 j1 is 1 j2 is 1 and imm11 is 5

    I1 = NOT(J1 EOR S); I2 = NOT(J2 EOR S); imm32 = SignExtend(S:I1:I2:imm10:imm11:’0’, 32);
    

    1 EOR 0 is 1 right? not that you get 0. So I1 and I2 are both zeros the s is a zero imm10 is a zero. so we are basically on this one only looking at imm11 as a positive number

    the pc is four ahead when executing so so 0xE - 0x4 = 0xA.

    0xA / 2 = 0x5 and that is our branch offset offset pc + (5*2)

    .syntax unified
    .thumb
    
    
    b.w skip
    nop
    here:
    nop
    nop
    nop
    nop
    skip:
    b.w here
    
    00000000 <here-0x6>:
       0:   f000 b805   b.w e <skip>
       4:   46c0        nop         ; (mov r8, r8)
    
    00000006 <here>:
       6:   46c0        nop         ; (mov r8, r8)
       8:   46c0        nop         ; (mov r8, r8)
       a:   46c0        nop         ; (mov r8, r8)
       c:   46c0        nop         ; (mov r8, r8)
    
    0000000e <skip>:
       e:   f7ff bffa   b.w 6 <here>
    

    s is a 1, imm10 is 0x3FF j1 is 1 j2 is 1 imm1 is 0x7FA

    1 eor 1 is 0 not that you get 1 for i1 and same for i2

    imm32 = SignExtend(S:I1:I2:imm10:imm11:’0’, 32);
    

    s is a 1 so this will sign extend a 1 all but the last few bits are ones so the imm32 is 0xFFFFFFFA or -6 instructions back or -12 bytes back

    so our offset is ((0xE + 4) - 6)/2 = 6 as well. or look at it another way from the instruction encoding PC - (6*2) = (0xE + 4) - 12 = 6 branch to 0x6.

    So if you wanted to branch to say 0x70 and the address of the instruction is 0x12 then your offset is 0x70-(0x12+4) = 0x62 or 0x31 instructions, we know from the skip the trick is to make s 0 and j1 and j2 a 1

    0x12: 0xF000 0xB831  branch to 0x70
    

    so now knowing that we can go back to this:

    0:  e7fe        b.n 50 <*ABS*0x50>
    

    the offset is a sign extended 0x7FE or 0xFFFFFFFE. 0xFFFFFFFE*2 + 4 = 0xFFFFFFFC + 4 = 0x00000000. Branch to 0

    add a nop

    .thumb
    nop
    b 0x50
    
    00000000 <.text>:
       0:   46c0        nop         ; (mov r8, r8)
       2:   e7fe        b.n 50 <*ABS*0x50>
    

    same encoding

    so the disassembly implies an absolute value of 0x50 but is not encoding it, linking doesnt help it just complains

    (.text+0x0): relocation truncated to fit: R_ARM_THM_JUMP11 against `*ABS*0x50'
    

    this

    .thumb
    nop
    b 0x51
    

    gives the same encoding.

    So basically there is something wrong with this syntax and/or it is looking for a label named 0x50 perhaps?

    I hope your example was you wanting to know the encoding of a branch to some address instead of that exact syntax.

    arm is not like some other instruction sets, the branches are always relative. so if you can reach the destination based on the encoding then you get a branch, otherwise, you have to use a bx or pop or one of the other ways to modify the pc (with an absolute value).

    knowing that the T2 encoding from the docs can only reach 2048 ahead, then put more than 2048 nops between the branch and its destination

    b.s: Assembler messages:
    b.s:5: Error: branch out of range
    

    Maybe this is what you are looking to do?

    .thumb
    mov r0,#0x51
    bx r0
    
    00000000 <.text>:
       0:   2051        movs    r0, #81 ; 0x51
       2:   4700        bx  r0
    

    branch to absolute address 0x50. for that specific address no need for thumb2 extensions.

    .thumb
    ldr r0,=0x12345679
    bx r0
    00000000 <.text>:
       0:   4800        ldr r0, [pc, #0]    ; (4 <.text+0x4>)
       2:   4700        bx  r0
       4:   12345679    eorsne  r5, r4, #126877696  ; 0x7900000
    

    branch to address 0x12345678 or any other possible address.