I'm trying to assemble this code using Keystone and execute it with the Unicorn engine:
start:
add r0, r0, #1
add r1, r1, #2
bl start
b start
In my opinion, the bl
instruction should save the address of the next instruction to the lr
register and then jump to start
. So it'll be an infinite loop that adds 1
to r0
and 2
to r1
.
Apparently, I'm wrong, because bl start
branches to itself instead!
I'm using Python wrappers for Keystone, Capstone and Unicorn to process the assembly. Here's my code:
import keystone as ks
import capstone as cs
import unicorn as uc
print(f'Keystone {ks.__version__}\nCapstone {cs.__version__}\nUnicorn {uc.__version__}\n')
code = '''
start:
add r0, r0, #1
add r1, r1, #2
bl start
b start
'''
assembler = ks.Ks(ks.KS_ARCH_ARM, ks.KS_MODE_THUMB)
disassembler = cs.Cs(cs.CS_ARCH_ARM, cs.CS_MODE_THUMB)
emulator = uc.Uc(uc.UC_ARCH_ARM, uc.UC_MODE_THUMB)
machine_code, _ = assembler.asm(code)
machine_code = bytes(machine_code)
print(machine_code.hex())
initial_address = 0
for addr, size, mnem, op_str in disassembler.disasm_lite(machine_code, initial_address):
instruction = machine_code[addr:addr + size]
print(f'{addr:04x}|\t{instruction.hex():<8}\t{mnem:<5}\t{op_str}')
emulator.mem_map(initial_address, 1024) # allocate 1024 bytes of memory
emulator.mem_write(initial_address, machine_code) # write the machine code
emulator.hook_add(uc.UC_HOOK_CODE, lambda uc, addr, size, _: print(f'Address: {addr}'))
emulator.emu_start(initial_address | 1, initial_address + len(machine_code), timeout=500)
This is what it outputs:
Keystone 0.9.1
Capstone 5.0.0
Unicorn 1.0.2
00f1010001f10201fff7fefff8e7
0000| 00f10100 add.w r0, r0, #1
0004| 01f10201 add.w r1, r1, #2
0008| fff7feff bl #8 ; why not `bl #0`?
000c| f8e7 b #0
Address: 0
Address: 4
Address: 8 # OK, we arrived at BL start
Address: 8 # we're at the same instruction again?
Address: 8 # and again?
Address: 8
< ... >
Address: 8
Address: 8
Traceback (most recent call last):
File "run_ARM_bug.py", line 32, in <module>
emulator.emu_start(initial_address | 1, initial_address + len(machine_code), timeout=500)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/unicorn-1.0.2rc3-py3.7.egg/unicorn/unicorn.py", line 317, in emu_start
unicorn.unicorn.UcError: Emulation timed out (UC_ERR_TIMEOUT)
The exception is not a problem (I set the timeout myself). The problem is that bl start
always jumps to itself instead of start
.
If I jump forward, however, everything will work as expected, so this works - bl
jumps to the correct address:
start:
; stuff
bl next
; hello
next:
add r0, r0, #1
bkpt
EDIT
I went on and assembled this code with Clang:
; test.s
.text
.syntax unified
.globl start
.p2align 1
.code 16
.thumb_func
start:
add r0, r0, #1
add r1, r1, #2
bl start
b start
Used the following commands:
$ clang -c test.s -target armv7-unknown-linux -o test.bin -mthumb
clang-11: warning: unknown platform, assuming -mfloat-abi=soft
And then disassembled test.bin
with objdump
:
$ objdump -d test.bin
test.bin: file format elf32-littlearm
Disassembly of section .text:
00000000 <start>:
0: 00 f1 01 00 add.w r0, r0, #1
4: 01 f1 02 01 add.w r1, r1, #2
8: ff f7 fe ff bl #-4
c: ff f7 fe bf b.w #-4 <start+0x10>
$
So bl
's argument is actually an offset. It's negative because we're going backwards. BUT, as the documentation says:
For
B
,BL
,CBNZ
, andCBZ
instructions, the value of the PC is the address of the current instruction plus 4 bytes.
So bl #-4
will jump to (the address of bl) + 4 bytes - 4 bytes
, or, in other words, itself, again!
So, I can't bl
backwards for some reason? What's happening here and how to fix it?
All tool "chain" linkers have to deal with function calls or other to external resources, you will see instructions like bl encoded as a branch to self or branch to zero or some such incomplete instruction (certainly for external labels). The tangent here is that some versions of clang appear to sometimes encode for a local address and sometimes not (at the assembler level). But when linked the offset/address is patched up (as in this case).
A generic clang (all targets, default x86 host) 3.7 at the object level gives the right instruction. 3.8 doesn't. That appears to be the time this change happened. Clang 10 generic doesn't but a hand built clang 10.0.0 specific to one target, does give the right answer at assemble time.
All of this is a tangent because that is at assembly time not final output. When linked you get the right answer (thus far, the OP may have other cases where it didn't).
.thumb
.syntax unified
.thumb_func
start:
add r0, r0, #1
add r1, r1, #2
bl start
b start
clang-3.8 -c so.s -target armv7-unknown-linux -o so.o
clang: warning: unknown platform, assuming -mfloat-abi=soft
arm-none-eabi-objdump -D so.o
so.o: file format elf32-littlearm
Disassembly of section .text:
00000000 <start>:
0: f100 0001 add.w r0, r0, #1
4: f101 0102 add.w r1, r1, #2
8: f7ff fffe bl 0 <start>
c: e7f8 b.n 0 <start>
bl here is a branch to self, incomplete.
But take that object and link it
arm-none-eabi-ld -Ttext=0 so.o -o so.elf
arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000000000000
arm-none-eabi-objdump -d so.elf
so.elf: file format elf32-littlearm
Disassembly of section .text:
00000000 <start>:
0: f100 0001 add.w r0, r0, #1
4: f101 0102 add.w r1, r1, #2
8: f7ff fffa bl 0 <start>
c: e7f8 b.n 0 <start>
And you get the correct answer.
Sorry for the misleading answer before I was off on a tangent there for a bit.
Now if linking doesn't fix it for you in all cases then, please comment.
Another part of the problem here is the tools not helping you:
0008| fff7feff bl #8 ; why not `bl #0`?
8: ff f7 fe ff bl #-4
This is the same instruction formerly pair of thumb instructions 0xF7FF, 0xFFFE but for armv7-ar it is considered one instruction, inseparable 0xF7FFFFFE.
Thanks to looking this up again to work on this question I found this out since I either knew it and forgot or didn't know.
Before ARMv6T2, J1 and J2 in encodings T1 and T2 were both 1, resulting in a smaller branch range. The instructions could be executed as two separate 16-bit instructions
I have demonstrated the two instructions being separate from each other on prior to armv7 architectures and showing they are not one instruction.
Anyway:
Same instruction as this from gnu
8: f7ff fffe bl 0 <start>
The gnu one is a little better but still has issues, the encoding is not bl 0 <start>
but that output indicates the ultimate desire and in the end is re-encoded to be correct when linked.
So the tools were also likely part of the problem understanding what is going on by not representing the machine code in a properly decodable format.