Search code examples
assemblyarmstm32cortex-mthumb

Assembling THUMB instrutions to execute on Cortex-M3


As an exercise, I want to let STM32F103 execute from internal SRAM. The idea is to write some THUMB assembly by hand, assemble it with arm-none-eabi-as, load the machine code into SRAM with OpenOCD's mwh instruction, set PC to the beginning of SRAM with reg pc 0x20000000, and finally step a few times.

Here is the assembly code I want to execute. It's basically a pointless loop.

# main.S
.thumb
.syntax unified

mov r0, #40
mov r1, #2
add r2, r0, r1
mvn r0, #0x20000000
bx r0

I need to get the machine code so that I can load it into SRAM, but the disassembler output doesn't seem to be right.

$ arm-none-eabi-as -mthumb -mcpu=cortex-m3 -o main.o main.S
$ arm-none-eabi-objdump -d -m armv7 main.o

main.o:     file format elf32-littlearm


Disassembly of section .text:

00000000 <.text>:
   0:   f04f 0028   mov.w   r0, #40 ; 0x28
   4:   f04f 0102   mov.w   r1, #2
   8:   eb00 0201   add.w   r2, r0, r1
   c:   f06f 5000   mvn.w   r0, #536870912  ; 0x20000000
  10:   4700        bx  r0

Shouldn't the THUMB instructions be 16-bits in length? The machine code I got take 4 bytes per instruction.


Solution

  • The STM32F103 is cortex-m3 based. You need to start with the st document where it says that then go to arms website get the cortex-m3 technical reference manual. In that it tells you this is based on the armv7-m architecture and so you get the architectural reference manual. And then you can BEGIN to start programming.

    Running from flash the normal way uses a vector table, running from ram can mean that depending on the boot pins, but if you want to download the program using the debugger you are on the right path you just got stuck or stopped before finishing.

    # main.S
    .thumb
    .syntax unified
    
    mov r0, #40
    mov r1, #2
    add r2, r0, r1
    mvn r0, #0x20000000
    bx r0
    

    You specified unified syntax and perhaps on the command line cortex-m3? or armv7-m? So you ended up with thumb2 extensions they are two 16 bit halves as documented by ARM (armv7-m architectural reference manual shows you all the instructions). They are variable length the first one is decoded the second one is just operands. The non-thumb2 are all 16 bit, the bl/blx were/are two separate 16 bit instructions, but the cortex-ms want those to be back to back where on prior cores you could actually separate them to demonstrate they were truly two different instructions.

    so for example

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    add r2, r0, r1
    adds r2, r0, r1
    
    00000000 <.text>:
       0:   eb00 0201   add.w   r2, r0, r1
       4:   1842        adds    r2, r0, r1
    

    The 16 bit "all thumb variant" encoding is with flags only so you have to put adds; if gnu assembler and you specified unified syntax, which most folks are going to tell you to do, I do not personally. Just so you know:

    .cpu cortex-m3
    .thumb
    
    add r2, r0, r1
    adds r2, r0, r1
    
    so.s: Assembler messages:
    so.s:6: Error: instruction not supported in Thumb16 mode -- `adds r2,r0,r1'
    

    so

    .cpu cortex-m3
    .thumb
    
    add r2, r0, r1
    add r2, r0, r1
    
    00000000 <.text>:
       0:   1842        adds    r2, r0, r1
       2:   1842        adds    r2, r0, r1
    

    Just to warn you in case you fall into that trap. And do not you just love that the disassembler uses adds.

    Anyway. So these are fine, these are

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    mov r0, #40
    mov r1, #2
    add r2, r0, r1
    mvn r0, #0x20000000
    bx r0
    
    
    00000000 <.text>:
       0:   f04f 0028   mov.w   r0, #40 ; 0x28
       4:   f04f 0102   mov.w   r1, #2
       8:   eb00 0201   add.w   r2, r0, r1
       c:   f06f 5000   mvn.w   r0, #536870912  ; 0x20000000
      10:   4700        bx  r0
    

    Like add the 16 bit encoding of mov is with flags so

    movs r0, #40
    movs r1, #2
    
    00000000 <.text>:
       0:   2028        movs    r0, #40 ; 0x28
       2:   2102        movs    r1, #2
       4:   eb00 0201   add.w   r2, r0, r1
       8:   f06f 5000   mvn.w   r0, #536870912  ; 0x20000000
       c:   4700        bx  r0
    

    and we know about add now

    00000000 <.text>:
       0:   2028        movs    r0, #40 ; 0x28
       2:   2102        movs    r1, #2
       4:   1842        adds    r2, r0, r1
       6:   f06f 5000   mvn.w   r0, #536870912  ; 0x20000000
       a:   4700        bx  r0
    

    The mvn makes no sense you want to branch to 0x20000000 two things, first you want 0x20000000 not 0xDFFFFFFF so try this

       0:   2028        movs    r0, #40 ; 0x28
       2:   2102        movs    r1, #2
       4:   1842        adds    r2, r0, r1
       6:   f04f 5000   mov.w   r0, #536870912  ; 0x20000000
       a:   4700        bx  r0
    

    Second this is a cortex-m so you can't bx to an even address that is how you switch to arm mode but this processor does not do that so you will fault. You need the lsbit set. So try this

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    movs r0, #40
    movs r1, #2
    adds r2, r0, r1
    ldr r0, =0x20000001
    bx r0
    
    00000000 <.text>:
       0:   2028        movs    r0, #40 ; 0x28
       2:   2102        movs    r1, #2
       4:   1842        adds    r2, r0, r1
       6:   4801        ldr r0, [pc, #4]    ; (c <.text+0xc>)
       8:   4700        bx  r0
       a:   0000        .short  0x0000
       c:   20000001    .word   0x20000001
    

    With gnu assembler the ldr equals thing will pick the most efficient (smallest instruction) solution if it can otherwise it pulls from the pool.

    Or you could do this and not use the pool

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    movs r0, #40
    movs r1, #2
    adds r2, r0, r1
    mov r0, #0x20000000
    orr r0,r0,#1
    bx r0
    

    This makes my skin crawl because you want to orr not add, but this would make it a halfword shorter if that matters:

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    movs r0, #40
    movs r1, #2
    adds r2, r0, r1
    mov r0, #0x20000000
    adds r0,#1
    bx r0
    
    00000000 <.text>:
       0:   2028        movs    r0, #40 ; 0x28
       2:   2102        movs    r1, #2
       4:   1842        adds    r2, r0, r1
       6:   f04f 5000   mov.w   r0, #536870912  ; 0x20000000
       a:   3001        adds    r0, #1
       c:   4700        bx  r0
    

    Then you need to link. But...

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    movs r0,#0
    loop:
       adds r0,#1
       b loop
    

    Link without a linker script to make this quick

    arm-none-eabi-as so.s -o so.o
    arm-none-eabi-ld -Ttext=0x20000000 so.o -o so.elf
    arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000020000000
    arm-none-eabi-objdump -d so.elf
        
    so.elf:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    20000000 <_stack+0x1ff80000>:
    20000000:   2000        movs    r0, #0
    
    20000002 <loop>:
    20000002:   3001        adds    r0, #1
    20000004:   e7fd        b.n 20000002 <loop>
    

    Open two windows, in one start openocd to connect to the board/chip

    In the other

    telnet localhost 4444
    

    When you get the openocd prompt assuming that all worked

    halt
    load_image so.elf
    resume 0x20000000
    

    Or you can resume 0x20000001 since that feels better but the tool is fine either way. Now

    halt
    reg r0
    resume
    halt
    reg r0
    resume
    

    Being an stm32 and being all thumb variant instructions this example will work on any stm32 I have heard of so far.

    What you will see is that r0 it will increment, the human time between resuming and halting again it will count many times times you can see the number change to see that the program is running.

    telnet localhost 4444
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    Open On-Chip Debugger
    > halt
    > load_image so.elf
    6 bytes written at address 0x20000000
    downloaded 6 bytes in 0.001405s (4.170 KiB/s)
    > resume 0x20000000
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0x01000000 pc: 0x20000002 msp: 0x20001000
    > reg r0
    r0 (/32): 0x000ED40C
    > resume 
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0x01000000 pc: 0x20000002 msp: 0x20001000
    > reg r0
    r0 (/32): 0x001C8777
    > 
    

    If you want to then put it in flash, assuming the blue pill (this is a blue pill right?) does not have a write protected flash which some do, but you can easily remove that (will let you figure that out, is not necessarily easy, pro tip a complete power cycle is involved at some point).

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    .word 0x20001000
    .word reset
    
    .thumb_func
    reset:
    movs r0,#0
    loop:
       adds r0,#1
       b loop
    
    arm-none-eabi-as so.s -o so.o
    arm-none-eabi-ld -Ttext=0x08000000 so.o -o so.elf
    arm-none-eabi-ld: warning: cannot find entry symbol _start; defaulting to 0000000008000000
    arm-none-eabi-objdump -d so.elf
        
    
    so.elf:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    08000000 <_stack+0x7f80000>:
     8000000:   20001000    .word   0x20001000
     8000004:   08000009    .word   0x08000009
    
    08000008 <reset>:
     8000008:   2000        movs    r0, #0
    
    0800000a <loop>:
     800000a:   3001        adds    r0, #1
     800000c:   e7fd        b.n 800000a <loop>
    

    The reset vector needs to be address of handler ORRED with one. And the vector table needs to be at 0x08000000 (or 0x00000000 but you will end up wanting 0x08000000 or 0x02000000 for some not this one, 0x08000000 for this one, read the docs).

    In the telnet into openocd

    flash write_image erase so.elf
    reset
    halt
    reg r0
    resume
    halt
    reg r0
    resume
    

    And now it is programmed in flash so if you power off then on that is what it runs.

    openocd will end with something like this

    Info : stm32f1x.cpu: hardware has 6 breakpoints, 4 watchpoints
    

    then the telnet session

    telnet localhost 4444
    
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    Open On-Chip Debugger
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0xa1000000 pc: 0x0800000a msp: 0x20001000
    > flash write_image erase so.elf
    auto erase enabled
    device id = 0x20036410
    flash size = 64kbytes
    wrote 1024 bytes from file so.elf in 0.115819s (8.634 KiB/s)
    > reset
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0x01000000 pc: 0x0800000a msp: 0x20001000
    > reg r0
    r0 (/32): 0x002721D4
    > resume
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0x01000000 pc: 0x0800000a msp: 0x20001000
    > reg r0
    r0 (/32): 0x0041DF80
    >       
    

    If you want the flash to reset into ram you can do that

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    .word 0x20001000
    .word 0x20000001
    

    Power cycles it should ideally crash/fault but if you use openocd to put something in ram like we did before

    flash.elf:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    08000000 <_stack+0x7f80000>:
     8000000:   20001000    .word   0x20001000
     8000004:   20000001    .word   0x20000001
    
    
    
    so.elf:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    20000000 <_stack+0x1ff80000>:
    20000000:   2000        movs    r0, #0
    
    20000002 <loop>:
    20000002:   3001        adds    r0, #1
    20000004:   e7fd        b.n 20000002 <loop>
    
    telnet localhost 4444
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    Open On-Chip Debugger
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0x01000000 pc: 0x0800000a msp: 0x20001000
    > flash write_image erase flash.elf
    auto erase enabled
    device id = 0x20036410
    flash size = 64kbytes
    wrote 1024 bytes from file flash.elf in 0.114950s (8.699 KiB/s)
    > load_image so.elf
    6 bytes written at address 0x20000000
    downloaded 6 bytes in 0.001399s (4.188 KiB/s)
    > reset
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0x01000000 pc: 0x20000002 msp: 0x20001000
    > reg r0
    r0 (/32): 0x001700E0
    > resume
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0x01000000 pc: 0x20000004 msp: 0x20001000
    > reg r0
    r0 (/32): 0x00245FF1
    > resume
    > halt
    target state: halted
    target halted due to debug-request, current mode: Thread 
    xPSR: 0x01000000 pc: 0x20000002 msp: 0x20001000
    > reg r0
    r0 (/32): 0x00311776
    > 
    

    but a power cycle

    telnet localhost 4444
    Trying 127.0.0.1...
    Connected to localhost.
    Escape character is '^]'.
    Open On-Chip Debugger
    > halt
    > reset
    stm32f1x.cpu -- clearing lockup after double fault
    target state: halted
    target halted due to debug-request, current mode: Handler HardFault
    xPSR: 0x01000003 pc: 0xfffffffe msp: 0x20000fe0
    Polling target stm32f1x.cpu failed, trying to reexamine
    stm32f1x.cpu: hardware has 6 breakpoints, 4 watchpoints
    > halt
    > 
    

    Yeah, not happy as expected/desired.

    Note _start comes from an ENTRY(_start) in a default linker script, it is not special nor really hard-coded into the tools (nor is main for gcc, that comes from a default bootstrap).

    So you can do this

    so.s

    .cpu cortex-m3
    .thumb
    .syntax unified
    movs r0,#0
    loop:
       adds r0,#1
       b loop
    

    so.ld

    MEMORY
    {
        hello : ORIGIN = 0x20000000, LENGTH = 0x1000
    }
    SECTIONS
    {
        .text   : { *(.text*)   } > hello
    }
    
    
    arm-none-eabi-as so.s -o so.o
    arm-none-eabi-ld -T so.ld so.o -o so.elf
    arm-none-eabi-objdump -d so.elf
    
    so.elf:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    20000000 <loop-0x2>:
    20000000:   2000        movs    r0, #0
    
    20000002 <loop>:
    20000002:   3001        adds    r0, #1
    20000004:   e7fd        b.n 20000002 <loop>
    

    and the _start warning goes away. Note that the section names you create in the linker script (hello in this case) do not have to be ram, rom, flash, etc they can be what you want and yes you could do this with a linker script but without a MEMORY section in the file and only SECTION.

    If you choose to

    arm-none-eabi-objcopy -O binary so.elf so.bin
    

    openocd can read elf files and some others but a raw memory image like that you have to specify the address otherwise you might get 0x00000000 or who knows what

    load_image so.bin 0x20000000
    

    If/when you get some nucleo boards, you can simply copy the bin file to the virtual thumb drive and it will load it into the target mcu for you and the virtual drive will sort of reload or will reload and show a FAIL.TXT if it did not work one way that happens is if you link for 0x00000000 instead of 0x08000000. You can't load for sram that way though, just flash. But I assume you have a blue pill not a nucleo board.

    That is the long answer.

    Short answer

    Those are thumb2 extensions they are two halfwords in size. See the armv7-m architectural reference manual for the instruction descriptions. They are perfectly fine for this chip.

    You probably want to use load_image not mwh on openocd, but mwh will work if you get your halfwords in the right order.

    You ideally want to link although as written your code or mine is position independent so arguably you could just extract the instructions and use mwh.

    The chip has a boot from sram mode which would/should use a vector table not just launch into instructions, you would need to get the boot pins set right and use something like openocd to load the program into ram, then reset (not power cycle).

    MVN move negative or negate is not the right instruction here and you need the lsbit set before using bx so you want 0x20000001 in the register, something like

    ldr r0,=0x20000001
    bx r0
    

    for gnu assembler, or

    mov r0,#0x20000000
    orr r0,#1
    bx r0
    

    but that is for armv7-m, for cortex-m0, m0+ some of the -m8s you can't use those instructions they will not work.

    .cpu cortex-m0
    .thumb
    .syntax unified
    mov r0,#0x20000000
    orr r0,#1
    bx r0
    
    arm-none-eabi-as so.s -o so.o
    so.s: Assembler messages:
    so.s:5: Error: cannot honor width suffix -- `mov r0,#0x20000000'
    so.s:6: Error: cannot honor width suffix -- `orr r0,#1'
    

    So use the ldr = pseudo instruction or load from the pool manually, or load 0x2 or 0x20 or something like that then shift it and load another register with 1 and orr it or use add (yuck).

    Edit

    .cpu cortex-m3
    .thumb
    .syntax unified
    .globl _start
    _start:
    ldr r0,=0x12345678
    b .
    
    
    00000000 <_start>:
       0:   4800        ldr r0, [pc, #0]    ; (4 <_start+0x4>)
       2:   e7fe        b.n 2 <_start+0x2>
       4:   12345678    eorsne  r5, r4, #120, 12    ; 0x7800000
    

    If it cannot generate a single instruction then it will generate a pc relative load and put the variable in a literal pool, somewhere after a branch if it can find one.

    But you can do this yourself too

    .cpu cortex-m3
    .thumb
    .syntax unified
    .globl _start
    _start:
    ldr r0,myvalue
    b .
    .align
    myvalue: .word 0x12345678
    
    
    00000000 <_start>:
       0:   4800        ldr r0, [pc, #0]    ; (4 <myvalue>)
       2:   e7fe        b.n 2 <_start+0x2>
    
    00000004 <myvalue>:
       4:   12345678    eorsne  r5, r4, #120, 12    ; 0x7800000
    

    The literal pool is an area of memory (in the text segment), which is used to store constants.

    unsigned int fun0 ( void )
    {
        return 0x12345678;
    }
    unsigned int fun1 ( void )
    {
        return 0x11223344;
    }
    00000000 <fun0>:
       0:   e59f0000    ldr r0, [pc]    ; 8 <fun0+0x8>
       4:   e12fff1e    bx  lr
       8:   12345678    .word   0x12345678
    
    0000000c <fun1>:
       c:   e59f0000    ldr r0, [pc]    ; 14 <fun1+0x8>
      10:   e12fff1e    bx  lr
      14:   11223344    .word   0x11223344
    

    Not unusual to have the C compiler do this and put it at the end of the function.

        .global fun1
        .syntax unified
        .arm
        .fpu softvfp
        .type   fun1, %function
    fun1:
        @ Function supports interworking.
        @ args = 0, pretend = 0, frame = 0
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        ldr r0, .L6
        bx  lr
    .L7:
        .align  2
    .L6:
        .word   287454020
        .size   fun1, .-fun1
    

    I did not build that for thumb/cortex-m but that is fine it would do the same thing. But, saying that:

    unsigned int fun0 ( void )
    {
        return 0x12345678;
    }
    unsigned int fun1 ( void )
    {
        return 0x00110011;
    }
    
    00000000 <fun0>:
       0:   4800        ldr r0, [pc, #0]    ; (4 <fun0+0x4>)
       2:   4770        bx  lr
       4:   12345678    .word   0x12345678
    
    00000008 <fun1>:
       8:   f04f 1011   mov.w   r0, #1114129    ; 0x110011
       c:   4770        bx  lr
    

    Since I have a rough idea of what immediates you can use for the various arm instruction sets. Likewise

    .cpu cortex-m3
    .thumb
    .syntax unified
    .globl _start
    _start:
    ldr r0,=0x12345678
    ldr r1,=0x00110011
    nop
    nop
    nop
    b .
    
    00000000 <_start>:
       0:   4803        ldr r0, [pc, #12]   ; (10 <_start+0x10>)
       2:   f04f 1111   mov.w   r1, #1114129    ; 0x110011
       6:   bf00        nop
       8:   bf00        nop
       a:   bf00        nop
       c:   e7fe        b.n c <_start+0xc>
       e:   0000        .short  0x0000
      10:   12345678    .word   0x12345678
    

    By using the ldr = thing gnu assembler will pick the optimal instruction. This is not supported by all arm assemblers (assembly language is defined by the tool not the target), and not all will choose the optimal instruction some may always generate the pc-relative ldr if they recognize the syntax at all.

    It is somewhat meant to be used to get the address of a label for example

    .cpu cortex-m3
    .thumb
    .syntax unified
    .globl _start
    _start:
    ldr r0,=mydataword
    ldr r1,[r0]
    add r1,#1
    str r1,[r0]
    bx lr
    
    .data
    mydataword: .word 0
    

    being in another segment it can't resolve this at assembly time so it leaves a placeholder for the linker

    00000000 <_start>:
       0:   4802        ldr r0, [pc, #8]    ; (c <_start+0xc>)
       2:   6801        ldr r1, [r0, #0]
       4:   f101 0101   add.w   r1, r1, #1
       8:   6001        str r1, [r0, #0]
       a:   4770        bx  lr
       c:   00000000    .word   0x00000000
    
    arm-none-eabi-ld -Ttext=0x1000 -Tdata=0x2000 so.o -o so.elf
    arm-none-eabi-objdump -D so.elf
    
    so.elf:     file format elf32-littlearm
    
    
    Disassembly of section .text:
    
    00001000 <_start>:
        1000:   4802        ldr r0, [pc, #8]    ; (100c <_start+0xc>)
        1002:   6801        ldr r1, [r0, #0]
        1004:   f101 0101   add.w   r1, r1, #1
        1008:   6001        str r1, [r0, #0]
        100a:   4770        bx  lr
        100c:   00002000    andeq   r2, r0, r0
    
    Disassembly of section .data:
    
    00002000 <__data_start>:
        2000:   00000000
    

    Or

    .cpu cortex-m3
    .thumb
    .syntax unified
    .globl _start
    _start:
    ldr r0,=somefun
    ldr r1,[r0]
    orr r1,#1
    bx r1
    .align
    somefun:
        nop
        b .
    

    even in the same segment

    00000000 <_start>:
       0:   4803        ldr r0, [pc, #12]   ; (10 <somefun+0x4>)
       2:   6801        ldr r1, [r0, #0]
       4:   f041 0101   orr.w   r1, r1, #1
       8:   4708        bx  r1
       a:   bf00        nop
    
    0000000c <somefun>:
       c:   bf00        nop
       e:   e7fe        b.n e <somefun+0x2>
      10:   0000000c    .word   0x0000000c
    
    
    00001000 <_start>:
        1000:   4803        ldr r0, [pc, #12]   ; (1010 <somefun+0x4>)
        1002:   6801        ldr r1, [r0, #0]
        1004:   f041 0101   orr.w   r1, r1, #1
        1008:   4708        bx  r1
        100a:   bf00        nop
    
    0000100c <somefun>:
        100c:   bf00        nop
        100e:   e7fe        b.n 100e <somefun+0x2>
        1010:   0000100c    andeq   r1, r0, r12
    

    If you let the tools do the work though

    .cpu cortex-m3
    .thumb
    .syntax unified
    .globl _start
    _start:
    ldr r0,=somefun
    ldr r1,[r0]
    bx r1
    .align
    .thumb_func
    somefun:
        nop
        b .
    

    You do not need to orr in the lsbit, the tool does it for you

    00001000 <_start>:
        1000:   4802        ldr r0, [pc, #8]    ; (100c <somefun+0x4>)
        1002:   6801        ldr r1, [r0, #0]
        1004:   4708        bx  r1
        1006:   bf00        nop
    
    00001008 <somefun>:
        1008:   bf00        nop
        100a:   e7fe        b.n 100a <somefun+0x2>
        100c:   00001009    andeq   r1, r0, r9
    

    These are all or mostly cases of the literal pool being used to help out with an instruction set like this that is somewhat fixed in length so has a limit on immediate values.

    Sometimes you can help gnu assembler as to where to put the pool data

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    .globl fun0
    .thumb_func
    fun0:
    ldr r0,=0x12345678
    bx lr
    .globl fun1
    .thumb_func
    fun1:
    ldr r0,=0x11223344
    bx lr
    .align
    .word 0x111111
    
    00000000 <fun0>:
       0:   4802        ldr r0, [pc, #8]    ; (c <fun1+0x8>)
       2:   4770        bx  lr
    
    00000004 <fun1>:
       4:   4802        ldr r0, [pc, #8]    ; (10 <fun1+0xc>)
       6:   4770        bx  lr
       8:   00111111    .word   0x00111111
       c:   12345678    .word   0x12345678
      10:   11223344    .word   0x11223344
    

    but if I

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    .globl fun0
    .thumb_func
    fun0:
    ldr r0,=0x12345678
    bx lr
    .pool
    .globl fun1
    .thumb_func
    fun1:
    ldr r0,=0x11223344
    bx lr
    .align
    .word 0x111111
    
    00000000 <fun0>:
       0:   4800        ldr r0, [pc, #0]    ; (4 <fun0+0x4>)
       2:   4770        bx  lr
       4:   12345678    .word   0x12345678
    
    00000008 <fun1>:
       8:   4801        ldr r0, [pc, #4]    ; (10 <fun1+0x8>)
       a:   4770        bx  lr
       c:   00111111    .word   0x00111111
      10:   11223344    .word   0x11223344
    

    So

    ldr r0,=something
    

    Means at link time or sometime load the address of something into r0. Labels are just addresses which are just values/numbers so

    ldr r0,=0x12345678
    

    Means the same thing the label is instead the value itself so give me the address of that label which is 0x12345678 and put that in r0, so it is an interesting extension of that notion that gas or someone thought of, probably arms assembler, I do not remember then others adopted it as well or improved upon it or whatever. Note if you want to do it yourself you do this

    ldr r0,something_address
    b .
    .align
    something_address: .word something
    

    because something is a label which is an address which is a value you do not put the equals there, the equals is just for the ldr instruction. Same as the vector table:

    .word 0x20001000
    .word reset
    

    And lastly you can do one of these to get the function address correct for so called thumb interwork

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    .word 0x20001000
    .word reset
    .word handler
    .word broken
    
    .thumb_func
    reset:
        b .
    
    .type handler,%function
    handler:
        b .
        
    broken:
        b .
    
    Disassembly of section .text:
    
    08000000 <_stack+0x7f80000>:
     8000000:   20001000    .word   0x20001000
     8000004:   08000011    .word   0x08000011
     8000008:   08000013    .word   0x08000013
     800000c:   08000014    .word   0x08000014
    
    08000010 <reset>:
     8000010:   e7fe        b.n 8000010 <reset>
    
    08000012 <handler>:
     8000012:   e7fe        b.n 8000012 <handler>
    
    08000014 <broken>:
     8000014:   e7fe        b.n 8000014 <broken>
    

    Can use .thumb_func if in thumb you can use .type label,%function both in arm mode and thumb mode and you can see that it generates the proper thumb address in the vector table, but where neither were used the broken label is not generated correctly so that vector would fault on a cortex-m.

    Some folks sadly do this:

    .word reset + 1
    .word handler + 1
    .word broken + 1
    

    to try to fix that rather than using the tool as intended. Other assembly languages for arm/thumb meaning other tools (ARM, Kiel, etc) have their own syntax and rules this is limited to gnu assembler.

    Also note how much of this answer was just command line stuff, I examined the output of the tool and manipulated it until I got what I wanted, did not have to load and run code to see what was going on. Just use the tools.

    Edit 2

    Reading the rest of your question in the comment

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    ldr r0,=0x12345678
    nop
    b .
    
    
    00000000 <.text>:
       0:   4801        ldr r0, [pc, #4]    ; (8 <.text+0x8>)
       2:   bf00        nop
       4:   e7fe        b.n 4 <.text+0x4>
       6:   0000        .short  0x0000
       8:   12345678    .word   0x12345678
    

    Putting the .word at offset 6 would be an alignment fault for an ldr so they need to pad it to put it at a word aligned address.

    By now you should have downloaded the armv7-m architectural reference manual from ARM's website or elsewhere. And you can see at least in the one I am looking at (these are constantly evolving documents) the T1 encoding

    imm32 = ZeroExtend(imm8:'00', 32); add = TRUE;
    

    and further down

    Encoding T1 multiples of four in the range 0 to 1020
    

    and

    address = if add then (base + imm32) else (base - imm32);
    data = MemU[address,4];
    R[t] = data;
    

    The offset (immediate) encoded in the instruction is the number of words relative to the pc. The pc is "two ahead" or address of the instruction plus 4 so for the ldr r0 instruction

       0:   4801        ldr r0, [pc, #4]    ; (8 <.text+0x8>)
       2:   bf00        nop
       4:   e7fe        b.n 4 <.text+0x4>  <--- pc is here
       6:   0000        .short  0x0000
       8:   12345678    .word   0x12345678
    

    8 - 4 = 4; 4>>2 = 1 so 1 word away from the pc, instruction 0x48xx the xx is 0x4801 to indicate one word. Here again the alignment to use this instruction.

    So what if we

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    nop
    ldr r0,=0x12345678
    b .
    
    
    00000000 <.text>:
       0:   bf00        nop
       2:   4801        ldr r0, [pc, #4]    ; (8 <.text+0x8>)
       4:   e7fe        b.n 4 <.text+0x4>
       6:   0000        .short  0x0000
       8:   12345678    .word   0x12345678
    

    that seems broken

    Operation
    
    if ConditionPassed() then
      EncodingSpecificOperations();
      base = Align(PC,4);
      address = if add then (base + imm32) else (base - imm32);
      data = MemU[address,4];
      if t == 15 then
        if address<1:0> == '00' then LoadWritePC(data); else UNPREDICTABLE;
      else
        R[t] = data;
    

    When you see all of the pseudo code, then a pc of 6 in this case

    Then continuing to read the documentation to understand the pseudo code

    Calculate the PC or Align(PC,4) value of the instruction. The PC value of an instruction is its address plus 4 for a Thumb instruction. The Align(PC,4) value of an instruction is its PC value ANDed with 0xFFFFFFFC to force it to be word-aligned.

    so 0x6 & 0xFFFFFFFC = 4. 8 - 4 = 4; 4>>2 = 1; so 0x4801.

    If we force the thumb2 instruction

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    ldr.w r0,=0x12345678
    b .
    

    It still aligns probably to save us from faults the thumb2 version can reach odd values

    00000000 <.text>:
       0:   f8df 0004   ldr.w   r0, [pc, #4]    ; 8 <.text+0x8>
       4:   e7fe        b.n 4 <.text+0x4>
       6:   0000        .short  0x0000
       8:   12345678    .word   0x12345678
    

    Note the 4 at the end of the instruction that is pc + 4, but what if we tried to do this:

    .cpu cortex-m3
    .thumb
    .syntax unified
    
    ldr.w r0,something
    b .
    something: .word 0x12345678