Search code examples
pythonpython-3.xarmthumb

ARM Thumb BL instruction loops to itself


I'm trying to assemble this code using Keystone and execute it with the Unicorn engine:

start:
    add r0, r0, #1
    add r1, r1, #2
    bl start
    b start

In my opinion, the bl instruction should save the address of the next instruction to the lr register and then jump to start. So it'll be an infinite loop that adds 1 to r0 and 2 to r1.

Apparently, I'm wrong, because bl start branches to itself instead!

I'm using Python wrappers for Keystone, Capstone and Unicorn to process the assembly. Here's my code:

import keystone as ks
import capstone as cs
import unicorn as uc

print(f'Keystone {ks.__version__}\nCapstone {cs.__version__}\nUnicorn {uc.__version__}\n')


code = '''
start:
    add r0, r0, #1
    add r1, r1, #2
    bl start
    b start
'''

assembler = ks.Ks(ks.KS_ARCH_ARM, ks.KS_MODE_THUMB)
disassembler = cs.Cs(cs.CS_ARCH_ARM, cs.CS_MODE_THUMB)
emulator = uc.Uc(uc.UC_ARCH_ARM, uc.UC_MODE_THUMB)

machine_code, _ = assembler.asm(code)
machine_code = bytes(machine_code)
print(machine_code.hex())

initial_address = 0
for addr, size, mnem, op_str in disassembler.disasm_lite(machine_code, initial_address):
    instruction = machine_code[addr:addr + size]
    print(f'{addr:04x}|\t{instruction.hex():<8}\t{mnem:<5}\t{op_str}')

emulator.mem_map(initial_address, 1024)  # allocate 1024 bytes of memory
emulator.mem_write(initial_address, machine_code)  # write the machine code
emulator.hook_add(uc.UC_HOOK_CODE, lambda uc, addr, size, _: print(f'Address: {addr}'))
emulator.emu_start(initial_address | 1, initial_address + len(machine_code), timeout=500)

This is what it outputs:

Keystone 0.9.1
Capstone 5.0.0
Unicorn 1.0.2

00f1010001f10201fff7fefff8e7
0000|   00f10100    add.w   r0, r0, #1
0004|   01f10201    add.w   r1, r1, #2
0008|   fff7feff    bl      #8         ; why not `bl #0`?
000c|   f8e7        b       #0
Address: 0
Address: 4
Address: 8  # OK, we arrived at BL start
Address: 8  # we're at the same instruction again?
Address: 8  # and again?
Address: 8
< ... >
Address: 8
Address: 8
Traceback (most recent call last):
  File "run_ARM_bug.py", line 32, in <module>
    emulator.emu_start(initial_address | 1, initial_address + len(machine_code), timeout=500)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/unicorn-1.0.2rc3-py3.7.egg/unicorn/unicorn.py", line 317, in emu_start
unicorn.unicorn.UcError: Emulation timed out (UC_ERR_TIMEOUT)

The exception is not a problem (I set the timeout myself). The problem is that bl start always jumps to itself instead of start.

If I jump forward, however, everything will work as expected, so this works - bl jumps to the correct address:

start:
    ; stuff
    bl next
    ; hello

next:
    add r0, r0, #1
    bkpt

EDIT

I went on and assembled this code with Clang:

; test.s

.text
.syntax unified
.globl  start       
.p2align    1
.code   16       
.thumb_func
start:
    add r0, r0, #1
    add r1, r1, #2
    bl start
    b start

Used the following commands:

$ clang -c test.s -target armv7-unknown-linux -o test.bin -mthumb
clang-11: warning: unknown platform, assuming -mfloat-abi=soft

And then disassembled test.bin with objdump:

$ objdump -d test.bin

test.bin:       file format elf32-littlearm


Disassembly of section .text:

00000000 <start>:
       0: 00 f1 01 00                   add.w   r0, r0, #1
       4: 01 f1 02 01                   add.w   r1, r1, #2
       8: ff f7 fe ff                   bl      #-4
       c: ff f7 fe bf                   b.w     #-4 <start+0x10>
$ 

So bl's argument is actually an offset. It's negative because we're going backwards. BUT, as the documentation says:

For B, BL, CBNZ, and CBZ instructions, the value of the PC is the address of the current instruction plus 4 bytes.

So bl #-4 will jump to (the address of bl) + 4 bytes - 4 bytes, or, in other words, itself, again!

So, I can't bl backwards for some reason? What's happening here and how to fix it?


Solution

  • All tool "chain" linkers have to deal with function calls or other to external resources, you will see instructions like bl encoded as a branch to self or branch to zero or some such incomplete instruction (certainly for external labels). The tangent here is that some versions of clang appear to sometimes encode for a local address and sometimes not (at the assembler level). But when linked the offset/address is patched up (as in this case).

    A generic clang (all targets, default x86 host) 3.7 at the object level gives the right instruction. 3.8 doesn't. That appears to be the time this change happened. Clang 10 generic doesn't but a hand built clang 10.0.0 specific to one target, does give the right answer at assemble time.

    All of this is a tangent because that is at assembly time not final output. When linked you get the right answer (thus far, the OP may have other cases where it didn't).

    .thumb
    .syntax unified
    .thumb_func
    start:
        add r0, r0, #1
        add r1, r1, #2
        bl start
        b start
    
    clang-3.8 -c so.s -target armv7-unknown-linux -o so.o
    clang: warning: unknown platform, assuming -mfloat-abi=soft
    arm-none-eabi-objdump -D so.o
    
    so.o:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    00000000 <start>:
       0:   f100 0001   add.w   r0, r0, #1
       4:   f101 0102   add.w   r1, r1, #2
       8:   f7ff fffe   bl  0 <start>
       c:   e7f8        b.n 0 <start>
    

    bl here is a branch to self, incomplete.

    But take that object and link it

    arm-none-eabi-ld -Ttext=0 so.o -o so.elf
    arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000000000
    arm-none-eabi-objdump -d so.elf
    
    so.elf:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    00000000 <start>:
       0:   f100 0001   add.w   r0, r0, #1
       4:   f101 0102   add.w   r1, r1, #2
       8:   f7ff fffa   bl  0 <start>
       c:   e7f8        b.n 0 <start>
    

    And you get the correct answer.

    Sorry for the misleading answer before I was off on a tangent there for a bit.

    Now if linking doesn't fix it for you in all cases then, please comment.

    Another part of the problem here is the tools not helping you:

    0008|   fff7feff    bl      #8         ; why not `bl #0`?
    
    8: ff f7 fe ff                   bl      #-4
    

    This is the same instruction formerly pair of thumb instructions 0xF7FF, 0xFFFE but for armv7-ar it is considered one instruction, inseparable 0xF7FFFFFE.

    Thanks to looking this up again to work on this question I found this out since I either knew it and forgot or didn't know.

    Before ARMv6T2, J1 and J2 in encodings T1 and T2 were both 1, resulting in a smaller branch range. The instructions could be executed as two separate 16-bit instructions

    I have demonstrated the two instructions being separate from each other on prior to armv7 architectures and showing they are not one instruction.

    Anyway:

    Same instruction as this from gnu

       8:   f7ff fffe   bl  0 <start>
    

    The gnu one is a little better but still has issues, the encoding is not bl 0 <start> but that output indicates the ultimate desire and in the end is re-encoded to be correct when linked.

    So the tools were also likely part of the problem understanding what is going on by not representing the machine code in a properly decodable format.