Bare-metal ARM str instruction behavior in QEMU

Description

I'm trying to build a bare-metal application (no OS, no bootloader) and run it in QEMU and am seeing some weird behavior with the str instruction not seeming to do anything.

For some context I just want to inject my program directly into RAM and run it. I'm using a modified bare-metal linker and startup.S as an example for laying out the memory and setting up the C environment. I don't really care about which ARM platform I'm using so I used the same one from their example, the vexpress-a9 with the cortex-a9 processor.

I modified the start-up file in order to have execution start directly at the start exception vector at 0x0 (which I'm treating as ROM, even though I know it's not). The idea is that the .text section gets put here, some set-up happens to set-up the .data, .bss and stack, and then I branch to main.

MEMORY
{
    ROM (rx) : ORIGIN = 0x00000000, LENGTH = 1M
    RAM (rwx): ORIGIN = 0x00400000, LENGTH = 4M
}

This actually works in that I can start QEMU, attach a gdb session, and step through the initialization code, but for the set-up that should happen in "RAM" (starting at 0x00400000) nothing gets initialized at all.

For the bit of assembly below, the idea is that I want to fill the FIQ stack section with 0xFEFEFEFE. So I set r1 to the start of the stack, sp to the end, and while r1 < sp I populate the address contained within r1 with the value in r0 and increment the address in r1 by 4 bytes.

Reset_Handler:
    /* FIQ stack */
    msr cpsr_c, MODE_FIQ
    ldr r1, =_fiq_stack_start
    ldr sp, =_fiq_stack_end
    movw r0, #0xFEFE
    movt r0, #0xFEFE

fiq_loop:
    cmp r1, sp
    strlt r0, [r1], #4   <<<< ISSUE HERE
    blt fiq_loop

This does loop correctly for the right number of iterations (the size of the stack), but nothing is happening for the strlt r0, [r1], #4 instruction.

If I inspect before the str instruction, r1 is the start of the stack and the value is 0x0:

>>> p/x $r1                                                                                                                   
$2 = 0x400008                                                                                                                 
>>> x/2hx $r1                                                                                                                 
0x400008:       0x0000  0x0000

After I step over the str instruction, r1 has moved 4 bytes, but the memory at the start of the stack is still 0x0:

>>> p/x $r1                                                                                                                   
$3 = 0x40000c 
>>> x/2hx 0x400008                                                                                                            
0x400008:       0x0000  0x0000

The memory doesn't get updated, but I can directly set values there so I know that it can be updated:

>>> set *(0x400008)=0x12345678                                                                                                
>>> x/2hx 0x400008                                                                                                            
0x400008:       0x5678  0x1234

I'm starting qemu with:

    qemu-system-arm \
        -nographic \
        -s \
        -S \
        --no-reboot \
        -machine vexpress-a9 \
        -cpu cortex-a9 \
        -m 12M \
        -device loader,file=out.elf

I've compiled with the -mcpu=cortex-a9 option, and believe I've provided QEMU with enough RAM. I'm really lost as to what's happening here, any help is appreciated.

Further Debugging

Per request, I've also added clarification on the state of the following entities:

What is the value of _fiq_stack_start?

0x00400008 <- This is what I expect, as I expect the fiq stack to start after the .data section, which holds 8 bytes
What is the value of _fiq_stack_end?

0x00401008 <- This is what I expect, as I specified the stack to be 4096 bytes
What are the contents of r1 at the moment of the cmp instruction?

r1 = 0x00400008 <- This is what I expect, as r1 should contain the start of the stack.
What are the contents of the sp register?

0x00401008 <- This is what I expect, as this should be the end of the stack
What are the the condition code bits at the moment the strlt starts?

Before the compare CPSR = 0x40000111 and after the compare CPSR = 0x80000111. This is expected b/c the value in r1 is less than the value of sp and the result of a positive signed comparison should put a 1 in bit 31.
What are the contents of r0?

0xfefefefe <- This is what I expect based on the two mv instructions to fill the r0 register with the value I want to be in the stack.
What happens if you change the strlt to str?

I actually tested this already, and I got the same behavior.

I've also tried these simple instructions:

    mov r0, #0x1234
    mov r3, #0x2
    str r0, [r3] /* Store value of R0 into addr at r3 */

And after stepping over each instruction I would expect the value 0x1 held with r0 to be placed into the memory address of 0x2 held within r3. But after inspection it isn't.

>>> p/x $r0
$7 = 0x1

>>> p/x $r3
$8 = 0x2

>>> x/2hx $r3
0x2 <_Reset+2>: 0xea00  0x0041

It's as if the str instruction is completely ignored.

Solution

"When I try to store to memory it's as if nothing happens" almost always means "I'm trying to store to somewhere where there isn't actually RAM". Sometimes this is "nothing's there", sometimes this is "there's flash or ROM there so the write is ignored". The root cause is usually "program linked to the wrong addresses".

These addresses:

ROM (rx) : ORIGIN = 0x00000000, LENGTH = 1M
RAM (rwx): ORIGIN = 0x00400000, LENGTH = 4M

don't match what QEMU models for the vexpress-a9. The first is OK (address 0 is the remappable area, which QEMU models as "always the flash memory, not guest runtime configurable"), but there is no RAM at 0x00400000 -- this address is inside the flash memory so it is not writeable like RAM.

You should use a linker map which puts the RAM area in what the memory map calls "Local DDR2", which starts at 0x6000_0000. The linker script in the tutorial you started from gets this (sort of) right because it uses 0x6000_0000 and 0x7000_0000 -- though note that that only works because the documentation in that tutorial says to use -m 512M which provides enough RAM to get up to the 0x7000_0000 memory range.

The reason your linker script happens to work on the xilinx-zynq-a9 board is that that board puts its block of RAM at address 0.

The overall thing here that I think is important to understand is that when you're writing a linker script you don't get to choose the addresses arbitrarily. The linker script must be written to match the address map of the board you're going to run the resulting bare-metal binary on, and then that binary won't run on a different board type.

(Incidentally, there is a QEMU bug on this board where we try to map both RAM and flash at address 0 -- I think effectively the flash "wins", so the effect is that the low address area is flash.)