Search code examples
assemblyarmembeddedcortex-m

Starting program from Reset_Handler without C code


I want to be able to run and debug a binary generated from pure assembly on an ARM Cortex-M4 microcontroller without having to use inline assembly inside a C program.

I have a linker script and some utility C startup code, which sets up the interrupt vector table, implements the Reset_Handler function, copies the .data section from flash to SRAM and then calls main(). This workflow works ok, but it's a bit clunky, and I would rather write assembly directly instead of inline in a C program that is nothing more than main() with the assembly mnemonics. I also want to know out of interest - maybe there is a better way altogether of going about this. The Reset_Handler function looks like this:

void Reset_Handler(void)
{
        //copy .data section to SRAM
        uint32_t size = (uint32_t)&_edata - (uint32_t)&_sdata;

        uint8_t *pDst = (uint8_t*)&_sdata; //sram
        uint8_t *pSrc = (uint8_t*)&_la_data; //flash

        for(uint32_t i =0 ; i < size ; i++)
        {
                *pDst++ = *pSrc++;
        }

        //Init. the .bss section to zero in SRAM
        size = (uint32_t)&_ebss - (uint32_t)&_sbss;
        pDst = (uint8_t*)&_sbss;
        for(uint32_t i =0 ; i < size ; i++)
        {
                *pDst++ = 0;
        }

        __libc_init_array();

        main();

}

EDIT: The details of my toolchain and linked script are included below.

  • Board: STM32F407VG with Cortex-M4.
  • OpenOCD and GDB for debugging
  • vim for code editor (my purpose is to work on baremetal without any IDE-provided startup or linker code).
  • arm-none-eabi-gcc for compiling and linking

I am trying to follow along with this tutorial, but instead of running the code in a VM, goal is to run and debug directly on the board.

My linker code:


ENTRY(Reset_Handler)

MEMORY
{
  FLASH(rx):ORIGIN =0x08000000,LENGTH =1024K
  SRAM(rwx):ORIGIN =0x20000000,LENGTH =128K
}


SECTIONS
{
  .text :
  {
    *(.isr_vector)
    *(.text)
    *(.rodata)
    . = ALIGN(4);
    _etext = .;
  }> FLASH
  
  _la_data = LOADADDR(.data);
  
  .data :
  {
    _sdata = .;
    *(.data)
    *(.data.*)
    . = ALIGN(4);
    _edata = .;
  }> SRAM AT> FLASH
  
  .bss :
  {
    _sbss = .;
    __bss_start__ = _sbss;
    *(.bss)
    *(.bss.*)
    *(COMMON)
    . = ALIGN(4);
    _ebss = .;
    __bss_end__ = _ebss;
       . = ALIGN(4); 
    end = .;
    __end__ = .;
  }> SRAM

  
}

Solution

  • Your linker script

    ENTRY(Reset_Handler)
    MEMORY
    {
      FLASH(rx):ORIGIN =0x08000000,LENGTH =1024K
      SRAM(rwx):ORIGIN =0x20000000,LENGTH =128K
    }
    SECTIONS
    {
      .text :
      {
        *(.isr_vector)
        *(.text)
        *(.rodata)
        . = ALIGN(4);
        _etext = .;
      }> FLASH
      _la_data = LOADADDR(.data);
      .data :
      {
        _sdata = .;
        *(.data)
        *(.data.*)
        . = ALIGN(4);
        _edata = .;
      }> SRAM AT> FLASH
      .bss :
      {
        _sbss = .;
        __bss_start__ = _sbss;
        *(.bss)
        *(.bss.*)
        *(COMMON)
        . = ALIGN(4);
        _ebss = .;
        __bss_end__ = _ebss;
           . = ALIGN(4);
        end = .;
        __end__ = .;
      }> SRAM
    }
    

    Since you have read the arm and st documents you know that the vector table starts with a stack pointer load value then the reset vector then other vectors, can be hundreds depending on the chip. The chip vendor maps the application flash at 0x08000000 and with certain boot options that can be mirrored to 0x00000000 where it needs to be for arm to boot off of it. And ram starts at 0x20000000 and is of some size based on the chip.

    .cpu cortex-m4
    
    .word 0x20001000
    .word Reset_Handler
    .word loop
    .word loop
    
    .globl Reset_Handler
    .thumb_func
    Reset_Handler:
        b loop
    
    .thumb_func
    loop:
        b .
    
    .align
    .word 0x11223344
    .word _edata
    .word _sdata
    .word _la_data
    .word _ebss
    .word _sbss
    .word 0x55667788
    

    Is not a bad starting point. The linker as you know from reading up on it can generate variables if you will which you can then use in your code as seen in the C code and is just as available here.

    build it

    arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m4 so.s -o so.o
    arm-none-eabi-ld -nostdlib -nostartfiles -T so.ld so.o -o so.elf
    arm-none-eabi-objdump -D so.elf > so.list
    arm-none-eabi-objcopy -O binary so.elf so.bin
    arm-none-eabi-objcopy -O srec --srec-forceS3 so.elf so.srec
    

    examine the dump

    Disassembly of section .text:
    
    08000000 <Reset_Handler-0x10>:
     8000000:   20001000    andcs   r1, r0, r0
     8000004:   08000011    stmdaeq r0, {r0, r4}
     8000008:   08000013    stmdaeq r0, {r0, r1, r4}
     800000c:   08000013    stmdaeq r0, {r0, r1, r4}
    
    08000010 <Reset_Handler>:
     8000010:   e7ff        b.n 8000012 <loop>
    
    08000012 <loop>:
     8000012:   e7fe        b.n 8000012 <loop>
     8000014:   11223344            ; <UNDEFINED> instruction: 0x11223344
     8000018:   20000000    andcs   r0, r0, r0
     800001c:   20000000    andcs   r0, r0, r0
     8000020:   08000030    stmdaeq r0, {r4, r5}
     8000024:   20000000    andcs   r0, r0, r0
     8000028:   20000000    andcs   r0, r0, r0
     800002c:   55667788    strbpl  r7, [r6, #-1928]!   ; 0xfffff878
    

    That is disassembled so it is trying to disassemble everything, look at this

    08000000 <Reset_Handler-0x10>:
     8000000:   20001000   sp initialization value
     8000004:   08000011   reset handler address orred with one (see the docs)
     8000008:   08000013   some other handler
     800000c:   08000013   some other handler
    
    
     8000014:   11223344   .word 0x11223344
     8000018:   20000000   .word _edata
     800001c:   20000000   .word _sdata
     8000020:   08000030   .word _la_data
     8000024:   20000000   .word _ebss
     8000028:   20000000   .word _sbss
     800002c:   55667788   .word 0x55667788
    

    There is no .data so edata and sdata are at the same place. la_data is a kind of strange thing, and then no .bss either so start and end in the same place. so add some

    .cpu cortex-m4
    
    .word 0x20001000
    .word Reset_Handler
    .word loop
    .word loop
    
    .globl Reset_Handler
    .thumb_func
    Reset_Handler:
        b loop
    
    .thumb_func
    loop:
        b .
    
    .align
    .word 0x11223344
    .word _edata
    .word _sdata
    .word _la_data
    .word _ebss
    .word _sbss
    .word 0x55667788
    
    .section .bss
    .byte 0
    
    .section .data
    .byte 0x66
    
    
    Disassembly of section .text:
    
    08000000 <Reset_Handler-0x10>:
     8000000:   20001000    andcs   r1, r0, r0
     8000004:   08000011    stmdaeq r0, {r0, r4}
     8000008:   08000013    stmdaeq r0, {r0, r1, r4}
     800000c:   08000013    stmdaeq r0, {r0, r1, r4}
    
    08000010 <Reset_Handler>:
     8000010:   e7ff        b.n 8000012 <loop>
    
    08000012 <loop>:
     8000012:   e7fe        b.n 8000012 <loop>
     8000014:   11223344            ; <UNDEFINED> instruction: 0x11223344
     8000018:   20000004    andcs   r0, r0, r4
     800001c:   20000000    andcs   r0, r0, r0
     8000020:   08000030    stmdaeq r0, {r4, r5}
     8000024:   20000008    andcs   r0, r0, r8
     8000028:   20000004    andcs   r0, r0, r4
     800002c:   55667788    strbpl  r7, [r6, #-1928]!   ; 0xfffff878
    
    Disassembly of section .data:
    
    20000000 <_sdata>:
    20000000:   00000066    andeq   r0, r0, r6, rrx
    
    Disassembly of section .bss:
    
    20000004 <__bss_start__>:
    20000004:   00000000    andeq   r0, r0, r0
    
     8000018:   20000004    andcs   r0, r0, r4
     800001c:   20000000    andcs   r0, r0, r0
     8000020:   08000030    stmdaeq r0, {r4, r5}
     8000024:   20000008    andcs   r0, r0, r8
     8000028:   20000004    andcs   r0, r0, r4
    

    so .data goes from 0x20000000 to 0x20000004(-1) and bss from 0x20000004 to 0x20000008(-1)

    S00A0000736F2E7372656338
    S315080000000010002011000008130000081300000863
    S31508000010FFE7FEE744332211040000200000002019
    S315080000203000000808000020040000208877665584
    S309080000306600000058
    S70508000011E1
    

    and at address 0x0800030 we can see the .data value

    So you can simply re-write the C code in assembly language (did not need to do this analysis but good to). If you do not put alignment into the linker script then you have to do a byte by byte copy like the C code or if lucky and want to put the code in for it you can try to instrument something faster but both ends need to be unaligned in the same way.

    The things you need to do in your bootstrap for an mcu like this, minimum,

    1) stack pointer
    2) .data
    3) .bss
    4) call/branch to C entry point
    5) infinite loop
    

    Many folks will say you should never return from main() but

    1) you can protect them anyway, and they will thank you later
    2) they perhaps have not created a purely event driven solution
    

    Does not hurt. So as you read in the documentation from arm they have a mechanism for loading the stack pointer, if you use that then that checks the first box.

    Not intended to be lean and mean, wholly untested, maybe buggy:

    .cpu cortex-m4
    .syntax unified
    
    .word 0x20001000
    .word Reset_Handler
    .word loop
    .word loop
    
    .globl Reset_Handler
    .thumb_func
    Reset_Handler:
        /*copy .data section to SRAM */
        /*uint32_t size = (uint32_t)&_edata - (uint32_t)&_sdata;*/
        ldr r0,=_edata
        ldr r1,=_sdata
        subs r0,r0,r1
        bne data_loop_done
    
        /*uint8_t *pDst = (uint8_t*)&_sdata; //sram*/
        /*uint8_t *pSrc = (uint8_t*)&_la_data; //flash*/
    
        ldr r2,=_la_data
    
        /*
        for(uint32_t i =0 ; i < size ; i++)
        {
                *pDst++ = *pSrc++;
        }
        */
    
    data_loop:
        ldrb r3,[r2]
        adds r2,#1
        strb r3,[r1]
        adds r1,#1
        subs r0,r0,#1
        bne data_loop
    data_loop_done:
    
        /*
        Init. the .bss section to zero in SRAM
        size = (uint32_t)&_ebss - (uint32_t)&_sbss;
        pDst = (uint8_t*)&_sbss;
        for(uint32_t i =0 ; i < size ; i++)
        {
                *pDst++ = 0;
        }
        */
    
        ldr r0,=_ebss
        ldr r1,=_sbss
        mov r2,#0
        subs r0,r0,r1
        bne bss_loop_done
    bss_loop:
        strb r2,[r1]
        adds r1,#1
        bne bss_loop
    bss_loop_done:
    
        /*__libc_init_array();*/
        bl __libc_init_array
    
        /*main();*/
        bl main
    
        b loop
    
    .thumb_func
    loop:
        b .
    
    __libc_init_array:
        bx lr
    
    main:
        bx lr
    
    .align
    .word 0x11223344
    .word _edata
    .word _sdata
    .word _la_data
    .word _ebss
    .word _sbss
    .word 0x55667788
    
    .section .bss
    .byte 0
    
    .section .data
    .byte 0x66
    

    But functional

    08000010 <Reset_Handler>:
     8000010:   4814        ldr r0, [pc, #80]   ; (8000064 <main+0x1e>)
     8000012:   4915        ldr r1, [pc, #84]   ; (8000068 <main+0x22>)
     8000014:   1a40        subs    r0, r0, r1
     8000016:   d106        bne.n   8000026 <data_loop_done>
     8000018:   4a14        ldr r2, [pc, #80]   ; (800006c <main+0x26>)
    
    0800001a <data_loop>:
     800001a:   7813        ldrb    r3, [r2, #0]
     800001c:   3201        adds    r2, #1
     800001e:   700b        strb    r3, [r1, #0]
     8000020:   3101        adds    r1, #1
     8000022:   3801        subs    r0, #1
     8000024:   d1f9        bne.n   800001a <data_loop>
    
    08000026 <data_loop_done>:
    ...
     8000064:   20000004    andcs   r0, r0, r4
     8000068:   20000000    andcs   r0, r0, r0
     800006c:   08000078    stmdaeq r0, {r3, r4, r5, r6}
    

    If you are careful you can do it without forcing thumb2 instructions where not necessary. You may be able to improve this with thumb2 instructions but if the linker script does its job then you can use ldr/str and do a word at a time possibly comparing with the end value not a size. Whichever...

    Hmm, yeah I did leave an instruction out of the above code...

        ldr r0,=_ebss
        ldr r1,=_sbss
        mov r2,#0
        cmp r0,r1
        beq bss_loop_done
    bss_loop:
        str r2,[r1]
        adds r1,#4
        cmp r0,r1
        bne bss_loop
    bss_loop_done:
    

    should be four or more times faster depending on the system (chip). BUT you have to insure that the start and end addresses are aligned. You can go further than that by increasing the alignment to a double-word boundary

        ldr r0,=_ebss
        ldr r1,=_sbss
        mov r2,#0
        mov r3,#0
        cmp r0,r1
        beq bss_loop_done
    bss_loop:
        stm r1!,{r2,r3}
        cmp r0,r1
        bne bss_loop
    bss_loop_done:
    

    Could have used the stm in the word at a time loop and saved an instruction. You might see a gain with 4 words at a time but might not on a cortex-m, getting up to 2 words is a nice balance. And you can do the same optimizations with the .data copy.

    I hope this was not a homework assignment, you still get to find and debug it if it were. But it is a simple matter of reading and porting the code. Looking at the endless supply of examples out there.

    Looking at the linker script now on the screen it was designed for:

    .cpu cortex-m4
    .syntax unified
    
    .section .isr_vector
    
    .word 0x20001000
    .word Reset_Handler
    .word loop
    .word loop
    
    .section .text
    
    .globl Reset_Handler
    .thumb_func
    Reset_Handler:
        b loop
    
    .thumb_func
    loop:
        b .
    
    Disassembly of section .text:
    
    08000000 <Reset_Handler-0x10>:
     8000000:   20001000    andcs   r1, r0, r0
     8000004:   08000011    stmdaeq r0, {r0, r4}
     8000008:   08000013    stmdaeq r0, {r0, r1, r4}
     800000c:   08000013    stmdaeq r0, {r0, r1, r4}
    
    08000010 <Reset_Handler>:
     8000010:   e7ff        b.n 8000012 <loop>
    
    08000012 <loop>:
     8000012:   e7fe        b.n 8000012 <loop>
    

    So that you do not have to get the objects on the command line in a certain order.

    There is an intimate relationship between the linker script and the bootstrap code, you can't really have one without the other, they are a pair. You cannot or should not attempt to mix and match various linker scripts and bootstrap code from projects willy nilly, need to keep them together as designed.

    Linker scripts are not portable and assembly language is not assumed to be portable so IMO you should make each as simple and lean and mean as possible, less is more, less to port, less to maintain, less toolchain specific stuff. That is not the general view of developers they love to make grossly over complicated linker scripts. The C library can play a role here too, with the gnu model the C library is really a separate part and you can insert whichever one you want (and it comes with its related bootstrap and linker script), but that depends on how that library works, the target, etc.

    A microcontroller without an RTOS is not really C library friendly so you have to ask yourself do I really need a C library, how much simpler and smaller (and cheaper) and readable and more maintainable can I make this project?

    Mine tend to look like this

    .thumb_func
    reset:
        bl main
        b .
    
    MEMORY
    {
        rom : ORIGIN = 0x08000000, LENGTH = 0x1000
        ram : ORIGIN = 0x20000000, LENGTH = 0x1000
    }
    SECTIONS
    {
        .text   : { *(.text*)   } > rom
        .rodata : { *(.rodata*) } > rom
        .bss    : { *(.bss*)    } > ram
    }
    

    For each one of us reading this with this experience you are going to see a different style, different opinion, etc. That is another feature of bare-metal, the freedom to do it your own way, only truly bound by the hardware rules, nothing else. No-one's solution is really wrong, it just reflects their style.