Search code examples
cgccarm

ARM-GCC not optimizing array access in structs?


I have the following minimal example code:

typedef struct {
    unsigned long a[16];
    unsigned long b[16];
} myStruct_t;

myStruct_t myStruct;

unsigned long getA(unsigned int index) 
{    
    return myStruct.a[index];
}

unsigned long getB(unsigned int index) 
{    
    return myStruct.b[index];
}

Compiler explorer: https://godbolt.org/z/d6WavW

This is compiled with ARM-GCC 8.2 and -O2 (O1 and O3 result in the same asm code, and other GCC versions seem to produce a comparable result as well)

Interestingly GCC calculates the offset of b at runtime. Since the address of b is constant I would have expected it to be pre-calculated, resulting in the same code size for functions getA and getB

getA:
        ldr     r3, .L3
        ldr     r0, [r3, r0, lsl #2]
        bx      lr
.L3:
        .word   myStruct
getB:
        ldr     r3, .L6
        add     r0, r0, #16
        ldr     r0, [r3, r0, lsl #2]
        bx      lr
.L6:
        .word   myStruct

Why does GCC omit this optimization?

What would be best coding practice? Do not put arrays in structs when speed is important?


Solution

  • This is just a missed optimization, which is probably related to optimizations in the non-target specific areas of the compiler. This is not limited to the cortex-m (is present in full sized ARM and risc-v and no doubt other (fixed-ish length instruction sets))

    00000000 <getA>:
       0:   00251793            slli    x15,x10,0x2
       4:   00000537            lui x10,0x0
       8:   00050513            addi    x10,x10,0 # 0 <getA>
       c:   953e                    c.add   x10,x15
       e:   4108                    c.lw    x10,0(x10)
      10:   8082                    c.jr    x1
    
    00000012 <getB>:
      12:   0541                    c.addi  x10,16
      14:   00251793            slli    x15,x10,0x2
      18:   00000537            lui x10,0x0
      1c:   00050513            addi    x10,x10,0 # 0 <getA>
      20:   953e                    c.add   x10,x15
      22:   4108                    c.lw    x10,0(x10)
      24:   8082                    c.jr    x1
    

    Yes gcc and clang do the same thing (I am using gcc 10.x here)

    getA:
        ldr r3, .L3
        ldr r0, [r3, r0, lsl #2]
        bx  lr
    .L4:
        .align  2
    .L3:
        .word   .LANCHOR0
        .size   getA, .-getA
    
    getB:
        ldr r3, .L6
        add r0, r0, #16
        ldr r0, [r3, r0, lsl #2]
        bx  lr
    .L7:
        .align  2
    .L6:
        .word   .LANCHOR0
    

    So two things stand out here. Now granted the functions are assumed to be built and optimized by themselves. So if the base address of the struct is what the linker is filling in and in this case both are using the same address and they are small enough that both functions could have reached the same pool data but this is risky for the tool to try to make an optimization like that due to the number of passes, also gcc at least generates assembly language and does not know in the end how big the functions will be so cannot simply assume both can reach.

    With x86, it is variable instruction length so it can embed the adjusted offset in the instruction:

    00000000004000f0 <getA>:
      4000f0:   89 ff                   mov    %edi,%edi
      4000f2:   48 8b 04 fd 60 01 60    mov    0x600160(,%rdi,8),%rax
      4000f9:   00 
      4000fa:   c3                      retq   
      4000fb:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
    
    0000000000400100 <getB>:
      400100:   89 ff                   mov    %edi,%edi
      400102:   48 8b 04 fd e0 01 60    mov    0x6001e0(,%rdi,8),%rax
      400109:   00 
      40010a:   c3 
    

    (and the size of long is allowed to vary from one compiler/target to another by definition)

    Going back to arm and gas we can try this:

    getB:
        ldr r3, .L6
        ldr r0, [r3, r0, lsl #2]
        bx  lr
    .L7:
        .align  2
    .L6:
        .word   .LANCHOR0+16
    

    giving

    00008000 <getA>:
        8000:   e59f3004    ldr r3, [pc, #4]    ; 800c <getA+0xc>
        8004:   e7930100    ldr r0, [r3, r0, lsl #2]
        8008:   e12fff1e    bx  lr
        800c:   00001000    .word   0x00001000
    
    00008010 <getB>:
        8010:   e59f3004    ldr r3, [pc, #4]    ; 801c <getB+0xc>
        8014:   e7930100    ldr r0, [r3, r0, lsl #2]
        8018:   e12fff1e    bx  lr
        801c:   00001010    .word   0x00001010
    

    So the instruction could have been removed yes.

    This is simply a missed optimization. The reason may be deeper than you think though, it may be a case that the (compiler) code is written generically to do everything based on the base address of the struct. This is useful if for example you want to generate position independent code.

    getA:
        ldr r3, .L3
        ldr r2, .L3+4
    .LPIC0:
        add r3, pc, r3
        ldr r3, [r3, r2]
        ldr r0, [r3, r0, lsl #2]
        bx  lr
    .L4:
        .align  2
    .L3:
        .word   _GLOBAL_OFFSET_TABLE_-(.LPIC0+8)
        .word   myStruct(GOT)
    
    getB:
        ldr r3, .L6
        ldr r2, .L6+4
    .LPIC1:
        add r3, pc, r3
        ldr r3, [r3, r2]
        add r0, r0, #16
        ldr r0, [r3, r0, lsl #2]
        bx  lr
    .L7:
        .align  2
    .L6:
        .word   _GLOBAL_OFFSET_TABLE_-(.LPIC1+8)
        .word   myStruct(GOT)
    

    So now you end up with not only pool values for each function in the binary, but you would then need GOT values for each possible offset in the struct that the code has generated.

    00008000 <getA>:
        8000:   e59f3010    ldr r3, [pc, #16]   ; 8018 <getA+0x18>
        8004:   e59f2010    ldr r2, [pc, #16]   ; 801c <getA+0x1c>
        8008:   e08f3003    add r3, pc, r3
        800c:   e7933002    ldr r3, [r3, r2]
        8010:   e7930100    ldr r0, [r3, r0, lsl #2]
        8014:   e12fff1e    bx  lr
        8018:   00010034    .word   0x00010034
        801c:   0000000c    .word   0x0000000c
    
    00008020 <getB>:
        8020:   e59f3014    ldr r3, [pc, #20]   ; 803c <getB+0x1c>
        8024:   e59f2014    ldr r2, [pc, #20]   ; 8040 <getB+0x20>
        8028:   e08f3003    add r3, pc, r3
        802c:   e7933002    ldr r3, [r3, r2]
        8030:   e2800010    add r0, r0, #16
        8034:   e7930100    ldr r0, [r3, r0, lsl #2]
        8038:   e12fff1e    bx  lr
        803c:   00010014    .word   0x00010014
        8040:   0000000c    .word   0x0000000c
    

    So you do not have a clean place where you can add that 16 at compile time, you have to have added another entry in the GOT which is an entry already there plus an offset. This gets very bad very quick as far as binary size.

    Looking at PIC with x86

    0000000000400120 <getA>:
      400120:   48 8d 05 d9 0e c0 ff    lea    -0x3ff127(%rip),%rax        # 1000 <myStruct>
      400127:   89 ff                   mov    %edi,%edi
      400129:   48 8b 04 f8             mov    (%rax,%rdi,8),%rax
      40012d:   c3                      retq   
      40012e:   66 90                   xchg   %ax,%ax
    
    0000000000400130 <getB>:
      400130:   48 8d 05 c9 0e c0 ff    lea    -0x3ff137(%rip),%rax        # 1000 <myStruct>
      400137:   89 ff                   mov    %edi,%edi
      400139:   48 8b 84 f8 80 00 00    mov    0x80(%rax,%rdi,8),%rax
      400140:   00 
      400141:   c3                      retq 
    

    with a variable length instruction set like this you can see they did not need to get the base address of GOT and then get an offset and add that.

    Also remember humans write these compilers and all of the mentioned compilers are used by a very large number of people. Also remember that some of these tools have more popular or certain instruction sets that might dominate now or in the past than others (and drive most of the work on the tool). x86 as an example with its variable length nature and able to take the generic icode like pseudo code

    get struct base address
    add 16
    read from that address
    

    and roll that into one instruction rather than three or four. So how many people are complaining, specifically and in general. Think about how complicated it would be and how risky to add some of these optimizations, vs the value added. For example the bulk of this code may have been generated by an x86 person, and the ARM person simply made the back end functional.

    Another thing to think about is that it is very easy to find things that appear to be missed optimizations in compiled code. Some folks have a fantasy that compilers are better than humans and you never need to learn assembly language (read it or write it). While the need to write it or cases where you might be tempted to take the compiled code and hand optimize are rare, you also need to think is it really, REALLY, affecting my performance? Or am I now adding some foreign programming language to my project written in C for example just to optimize out one instruction, which adds cost to the project in terms of readability and maintenance? Before you take that optimization path you could ask the bigger question in this case of why make these functions and burn even more instructions and performance than inlining the code manually, or doing things to encourage the compiler to inline.

    If you take a normal sized project one would expect to find some "missed optimizations". And if you spend as much time staring at this stuff as myself and others, you will also see things like gcc is getting worse at this not better over time, the last handful of major releases have tended to make bigger/worse code by some definition of worse (for arm, not necessarily for other targets). In addition to things like this happen often in compiler output in general. It is expected and rarely a surprise.

    You also may find that the person/people who work on the ARM backend for gcc and clang may be the same people at ARM or Kiel on their tools, etc and as a result you may see the same habits or limits.

    I think in this case, yes, this is a missed optimization, the linker is not involved here the compiler could have removed that instruction and added it to the pool value. At the same time, if you think about what it takes to write a compiler anyway, as well as then to add optimizations, and retain some level of reliability, and in particular debug ability for the compiler authors, it may be best for the compiler to operate as much as it can based on the base address of the structure. I will never know but someone could very well have seen this and tried and then the PIC case exploded the size of the binary (or the functionality broke horribly) and they just gave up...

    Good catch, keep your eyes out for more, as these things happen quite often in compiled output.

    EDIT

    Also note that your test function is a bit too simple if you were doing multiple accesses to the structure across the function there is a tradeoff between a pool address for each item and an instruction. If thumb then the instruction can be cheaper for some offsets, or the same.

    With me using full sized arm instructions we have the problem of both the 16th long word in the structure and in this case a long is 4 bytes so there is a multiply by 4 of the offset passed in as an argument, so tried this:

    typedef struct {
        unsigned char a[16];
        unsigned char b[16];
    } myStruct_t;
    
    myStruct_t myStruct;
    
    unsigned char getA(unsigned int index) 
    {    
        return myStruct.a[index];
    }
    
    unsigned char getB(unsigned int index) 
    {    
        return myStruct.b[index];
    }
    
    00000000 <getA>:
       0:   e59f3004    ldr r3, [pc, #4]    ; c <getA+0xc>
       4:   e7d30000    ldrb    r0, [r3, r0]
       8:   e12fff1e    bx  lr
       c:   00000000    .word   0x00000000
    
    00000010 <getB>:
      10:   e59f3008    ldr r3, [pc, #8]    ; 20 <getB+0x10>
      14:   e0833000    add r3, r3, r0
      18:   e5d30010    ldrb    r0, [r3, #16]
      1c:   e12fff1e    bx  lr
      20:   00000000    .word   0x00000000
    

    so here again there is not an instruction available in the chosen instruction set to do all of those operations. The +16 moved to the load, but to get there the base plus offset needed to be added. but

    unsigned char getB(unsigned int index) 
    {    
        return (myStruct.b[index]+myStruct.b[3]);
    }
    
    
    00000010 <getB>:
      10:   e59f3014    ldr r3, [pc, #20]   ; 2c <getB+0x1c>
      14:   e5d32013    ldrb    r2, [r3, #19]
      18:   e0833000    add r3, r3, r0
      1c:   e5d30010    ldrb    r0, [r3, #16]
      20:   e0800002    add r0, r0, r2
      24:   e20000ff    and r0, r0, #255    ; 0xff
      28:   e12fff1e    bx  lr
      2c:   00000000    .word   0x00000000
    

    the extra instruction is not there in this case, and in this case

    unsigned char getB(unsigned int index) 
    {    
        return (myStruct.b[index]+myStruct.b[index+1]);
    }
    
    00000010 <getB>:
      10:   e59f3014    ldr r3, [pc, #20]   ; 2c <getB+0x1c>
      14:   e0830000    add r0, r3, r0
      18:   e5d03011    ldrb    r3, [r0, #17]
      1c:   e5d00010    ldrb    r0, [r0, #16]
      20:   e0830000    add r0, r3, r0
      24:   e20000ff    and r0, r0, #255    ; 0xff
      28:   e12fff1e    bx  lr
      2c:   00000000    .word   0x00000000
    

    had the 16 bit added in then I guess that still would have worked, but this case:

    unsigned long getB(unsigned int index) 
    {    
        return (myStruct.b[index]+myStruct.b[index+1]);
    }
    
    00000010 <getB>:
      10:   e59f3014    ldr r3, [pc, #20]   ; 2c <getB+0x1c>
      14:   e2802011    add r2, r0, #17
      18:   e2800010    add r0, r0, #16
      1c:   e7932102    ldr r2, [r3, r2, lsl #2]
      20:   e7930100    ldr r0, [r3, r0, lsl #2]
      24:   e0820000    add r0, r2, r0
      28:   e12fff1e    bx  lr
      2c:   00000000    .word   0x00000000
    

    this demonstrates multiple accesses to the struct and the pool data thing would have saved one instruction, but not two...you would need two pool items one with 16 and one with 16 added then

    ldr r3, [pc, #20]
    add r2,r0,#1
    ldr r2, [r3, r2, lsl #2]
    ldr r0, [r3, r0, lsl #2]
    

    or something like that...

    This missed optimization, while real, is really an even smaller corner case that one has to question is it worth adding a lot of code and risk to the compiler for? Or let the generic mechanism just do its thing, when you have lots of accesses or are pic or both, the generic solution just works. Some instruction sets have more "rich" instructions allowing for fewer, although larger, instructions.

    Using a struct to start with instead of two independent arrays, using these access functions, those two decisions are not wise for performance in the first place, if a single instruction matters. Likewise is this difference actually measurable in the final application or does it just look distasteful. When you get to real performance issues that are bad enough to need to be addressed in this manner, it is good to start with more efficient high level code, then take the output of the compiler (why do all of the work) and then hand tune that in assembly language. And continue to evaluate if the difference is actually measurable and relevant, and if it is worth the work and risk of having this hand tuned code. Most of the time the performance is tolerable. Using a high level language in the first place means you accept a performance hit, so your sensitivity and tolerance needs to be managed.

    So the new bottom line is not much different:

    Yes this is a missed optimization, but it is a corner case that is not worth the effort or risk to the compiler and users IMO. One/some of the regulars on this tag and others will point that out (small functions do not always optimize well nor are expected to)