How to get qemu to run an arm thumb binary?

I'm trying to learn the basics of ARM assembly and wrote a fairly simple program to sort an array. I initially assembled it using the armv8-a option and ran the program under qemu while debugging with gdb. This worked fine and the program initialized the array and sorted it as expected.

Ultimately I would like to be able to write some assembly for my Raspberry Pi Pico, which has an ARM Cortex M0+, which I believe uses the armv6-m option. However, when I change the directive in my code, it compiles fine but behaves strangely in that the program counter increments by 4 after every instruction instead of the 2 that I expect for thumb. This is causing my program to not work correctly. I suspect that qemu is trying to run my code as if it were compiled for the full ARM instruction set instead of thumb, but I'm not sure why this is.

I am running on Ubuntu Linux 20.04 LTS, using qemu-arm version 4.2.1 (installed from the package manager). Does the qemu-arm executable only run full ARM binaries? If so, is there another qemu package I can install to run a thumb binary?

Here is my code if it is helpful:

.arch armv6-m
.cpu cortex-m0plus

.syntax unified
.thumb

.data
arr: .skip 4 * 10
len: .word 10

.section .text
.global _start

.align 2
_start:
    ldr r0, arr_adr @ load the address of the start of the array into register 0
    movs r1, #0 @ clear the counter register
    movs r2, #100

init_loop:
    str r2, [r0,r1] @ store r2's value to the base address of the array plus the offset stored in r1
    subs r2, r2, #10 @ subtract 10 from r2
    adds r1, #4 @ add 4 to the offset (1 word in bytes)
    cmp r1, #40 @ check if we've reached the end of the array
    bne init_loop

    movs r1, #0 @ clear the offset
out_loop:
    mov r3, r1 @ set the index of the minimum value to the current array index

    mov r4, r1 @ set the inner loop index to the outer loop index

in_loop:
    ldr r5, [r0,r3] @ load the minimum index's value to r5
    ldr r6, [r0,r4] @ load the inner loop's next value to r6
    cmp r6, r5 @ compare the two values
    bge in_loop_inc @ if r6 is greater than or equal to r5, increment and restart loop
    mov r3, r4 @ set the minimum index to the current index
in_loop_inc:
    adds r4, #4
    cmp r4, #40 @ check if at end of array
    blt in_loop

    ldr r5, [r0,r3] @ load the minimum index value into r5
    ldr r6, [r0,r1] @ load the current outer loop index value into r6
    str r6, [r0,r3] @ swap the two values
    str r5, [r0,r1]

    adds r1, #4 @ increment outer loop index
    cmp r1, #40 @ check if at end of array
    blt out_loop

loop:
    nop
    b loop

arr_adr: .word arr

Thank you for your help!

Solution

There are a couple of concepts to disentangle here:

(1) Arm vs Thumb : these are two different instruction sets. Most CPUs support both, some support only one. Both are available simultaneously if the CPU supports both. To simplify a little bit, if you jump to an address with the least significant bit set that means "go to Thumb mode", and jumping to an address with that bit clear means "go to Arm mode". (Interworking is a touch more complicated than that, but that's a good initial mental model.) Note that all Arm instructions are 4 bytes long, but Thumb instructions can be either 2 or 4 bytes long.

(2) A-profile vs M-profile : these are two different families of CPU architecture. M-profile is "microcontrollers"; A-profile is "applications processors", which is "(almost) everything else". M-profile CPUs always support Thumb and only Thumb code. A-profile CPUs support both Arm and Thumb. The Raspberry Pi Pico is a Cortex-M0+, which is M-profile.

(3) QEMU system emulation vs user-mode emulation : these are two different QEMU executables which run guest code in different ways. The system emulation binary (typically qemu-system-arm) runs "bare metal code", eg an entire OS. The guest code has full control and can handle exceptions, write to hardware devices, etc. The user emulation binary (typically qemu-arm) is for running Linux user-space binaries. Guest code is started in unprivileged mode and has access to the usual Linux system calls. For system emulation, which CPU is being emulated depends on what machine type you select with the -M or --machine option. For user-mode emulation, the default CPU is "A-profile with all supported features enabled" (this is --cpu max).

You're currently using qemu-arm which means you get user-mode emulation. This should support Thumb binaries, but unless you pass it a --cpu option it will be using an A-profile CPU. I would also suggest using a newer QEMU for M-profile work, because a lot of M-profile CPU bugs have been fixed since version 4.2. I think 4.2 is also too old to have the Cortex-M0 CPU.

GDB should tell you in the PSR what the T bit is set to -- use that to check whether you're in Thumb mode or Arm mode, rather than looking at how much the PC is incrementing by.

There's currently no QEMU system emulation of the Raspberry Pi Pico (though somebody has been doing some experimental work on one). If your assembly is just basic "working with registers and a bit of memory" you can do that with the user-mode emulator. Or you can try the 'microbit' machine model, which is a Cortex-M0 board -- if you're not doing things that are specific to the Pi Pico that might be good enough.