Search code examples
cassemblyarmv7eabi

Self written simple memset not working with -03 eabi gcc on ARMv7


I wrote a very simple memset in c that works fine up to -O2 but not with -O3...

memset:

void * memset(void * blk, int c, size_t n)
{
    unsigned char * dst = blk;

    while (n-- > 0)
        *dst++ = (unsigned char)c;

    return blk;
}

...which compiles to this assembly when using -O2:

20000430 <memset>:
20000430:       e3520000        cmp     r2, #0                  @ compare param 'n' with zero
20000434:       012fff1e        bxeq    lr                      @ if equal return to caller
20000438:       e6ef1071        uxtb    r1, r1                  @ else zero extend (extract byte from) param 'c'
2000043c:       e0802002        add     r2, r0, r2              @ add pointer 'blk' to 'n'
20000440:       e1a03000        mov     r3, r0                  @ move pointer 'blk' to r3
20000444:       e4c31001        strb    r1, [r3], #1            @ store value of 'c' to address of r3, increment r3 for next pass
20000448:       e1530002        cmp     r3, r2                  @ compare current store address to calculated max address
2000044c:       1afffffc        bne     20000444 <memset+0x14>  @ if not equal store next byte
20000450:       e12fff1e        bx      lr                      @ else back to caller

This makes sense to me. I annotated what happens here.

When I compile it with -O3 the program crashes. My memset calls itself repeatedly until it ate the whole stack:

200005e4 <memset>:
200005e4:       e3520000        cmp     r2, #0                  @ compare param 'n' with zero
200005e8:       e92d4010        push    {r4, lr}                @ ? (1)
200005ec:       e1a04000        mov     r4, r0                  @ move pointer 'blk' to r4 (temp to hold return value)
200005f0:       0a000001        beq     200005fc <memset+0x18>  @ if equal (first line compare) jump to epilogue
200005f4:       e6ef1071        uxtb    r1, r1                  @ zero extend (extract byte from) param 'c'
200005f8:       ebfffff9        bl      200005e4 <memset>       @ call myself ? (2)
200005fc:       e1a00004        mov     r0, r4                  @ epilogue start. move return value to r0
20000600:       e8bd8010        pop     {r4, pc}                @ restore r4 and back to caller

I can't figure out how this optimised version is supposed to work without any strb or similar. It doesn't matter if I try to set the memory to '0' or something else so the function is not only called on .bss (zero initialised) variables.

(1) This is a problem. This push gets endlessly repeated without a matching pop as it's called by (2) when the function doesn't early-exit because of 'n' being zero. I verified this with uart prints. Also r2 is never touched so why should the compare to zero ever become true?

Please help me understand what's happening here. Is the compiler assuming prerequisites that I may not fulfill?

Background: I'm using external code that requires memset in my baremetal project so I rolled my own. It's only used once on startup and not performance critical.

/edit: The compiler is called with these options:

arm-none-eabi-gcc -O3 -Wall -Wextra -fPIC -nostdlib -nostartfiles -marm -fstrict-volatile-bitfields -march=armv7-a -mcpu=cortex-a9 -mfloat-abi=hard -mfpu=neon-vfpv3

Solution

  • Your first question (1). That is per the calling convention if you are going to make a nested function call you need to preserve the link register, and you need to be 64 bit aligned. The code uses r4 so that is the extra register saved. No magic there.

    Your second question (2) it is not calling your memset it is optimizing your code because it sees it as an inefficient memset. Fuz has provided the answers to your question.

    Rename the function

    00000000 <xmemset>:
       0:   e3520000    cmp r2, #0
       4:   e92d4010    push    {r4, lr}
       8:   e1a04000    mov r4, r0
       c:   0a000001    beq 18 <xmemset+0x18>
      10:   e6ef1071    uxtb    r1, r1
      14:   ebfffffe    bl  0 <memset>
      18:   e1a00004    mov r0, r4
      1c:   e8bd8010    pop {r4, pc}
    

    and you can see this.

    If you were to use -ffreestanding as Fuz recommended then you see this or something like it

    00000000 <xmemset>:
       0:   e3520000    cmp r2, #0
       4:   012fff1e    bxeq    lr
       8:   e92d41f0    push    {r4, r5, r6, r7, r8, lr}
       c:   e2426001    sub r6, r2, #1
      10:   e3560002    cmp r6, #2
      14:   e6efe071    uxtb    lr, r1
      18:   9a00002a    bls c8 <xmemset+0xc8>
      1c:   e3a0c000    mov r12, #0
      20:   e3520023    cmp r2, #35 ; 0x23
      24:   e7c7c01e    bfi r12, lr, #0, #8
      28:   e1a04122    lsr r4, r2, #2
      2c:   e7cfc41e    bfi r12, lr, #8, #8
      30:   e7d7c81e    bfi r12, lr, #16, #8
      34:   e7dfcc1e    bfi r12, lr, #24, #8
      38:   9a000024    bls d0 <xmemset+0xd0>
      3c:   e2445009    sub r5, r4, #9
      40:   e1a03000    mov r3, r0
      44:   e3c55007    bic r5, r5, #7
      48:   e3a07000    mov r7, #0
      4c:   e2851008    add r1, r5, #8
      50:   e1570005    cmp r7, r5
      54:   f5d3f0a0    pld [r3, #160]  ; 0xa0
      58:   e1a08007    mov r8, r7
      5c:   e583c000    str r12, [r3]
      60:   e583c004    str r12, [r3, #4]
      64:   e2877008    add r7, r7, #8
      68:   e583c008    str r12, [r3, #8]
      6c:   e2833020    add r3, r3, #32
      70:   e503c014    str r12, [r3, #-20] ; 0xffffffec
      74:   e503c010    str r12, [r3, #-16]
      78:   e503c00c    str r12, [r3, #-12]
      7c:   e503c008    str r12, [r3, #-8]
      80:   e503c004    str r12, [r3, #-4]
      84:   1afffff1    bne 50 <xmemset+0x50>
      88:   e2811001    add r1, r1, #1
      8c:   e483c004    str r12, [r3], #4
      90:   e1540001    cmp r4, r1
      94:   8afffffb    bhi 88 <xmemset+0x88>
      98:   e3c23003    bic r3, r2, #3
      9c:   e1520003    cmp r2, r3
      a0:   e0466003    sub r6, r6, r3
      a4:   e0803003    add r3, r0, r3
      a8:   08bd81f0    popeq   {r4, r5, r6, r7, r8, pc}
      ac:   e3560000    cmp r6, #0
      b0:   e5c3e000    strb    lr, [r3]
      b4:   08bd81f0    popeq   {r4, r5, r6, r7, r8, pc}
      b8:   e3560001    cmp r6, #1
      bc:   e5c3e001    strb    lr, [r3, #1]
      c0:   15c3e002    strbne  lr, [r3, #2]
      c4:   e8bd81f0    pop {r4, r5, r6, r7, r8, pc}
      c8:   e1a03000    mov r3, r0
      cc:   eafffff6    b   ac <xmemset+0xac>
      d0:   e1a03000    mov r3, r0
      d4:   e3a01000    mov r1, #0
      d8:   eaffffea    b   88 <xmemset+0x88>
    

    which appears like it simply inlined memset, the one it knows not your code (the faster one).

    So if you want it to use your code then stick with -O2. Yours is pretty inefficient so not sure why you need to push it any further than it was.

    20000444:       e4c31001        strb    r1, [r3], #1            @ store value of 'c' to address of r3, increment r3 for next pass
    20000448:       e1530002        cmp     r3, r2                  @ compare current store address to calculated max address
    2000044c:       1afffffc        bne     20000444 <memset+0x14>  @ if not equal store next byte
    

    It isn't going to get any better than that without replacing your code with something else.

    Fuz already answered the question:

    Compile with -fno-builtin-memset. The compiler recognises that the function implements memset and thus replaces it with a call to memset. You should in general compile with -ffreestanding when writing bare-metal code. I believe this fixes this sort of problem, too

    It is replacing your code with memset, if you want it not to do that use -ffreestanding.

    If you wish to go beyond that and wonder why -fno-builtin-memset didn't work that is a question for the gcc folks, file a ticket, let us know what they say (or just look at the compiler source code).