Search code examples
assemblygcccompiler-constructionx86-64

Why doesn't the generated GCC code for array allocation and access follow the `movl (%rdx, %rcx, 4), %eax` formula?


I am thinking about how to compile arrays in x86-64 assembly.

I am reading "Computer Systems - A Programmer's Perspective" and the authors give the following formula:

E[i] ->  `movl (%rdx, %rcx, 4i), %eax`

But I just used Compiler Explorer and GCC 12.2 to see the generated assembly for the following program:

int main() {
    int arr[] = {3,4,5};
    for (int i = 0; i < 3; i++) {
        arr[i];

    }
}

The generated code for this is:

main:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-16], 3
        mov     DWORD PTR [rbp-12], 4
        mov     DWORD PTR [rbp-8], 5
        mov     DWORD PTR [rbp-4], 0
        jmp     .L2
.L3:
        add     DWORD PTR [rbp-4], 1
.L2:
        cmp     DWORD PTR [rbp-4], 2
        jle     .L3
        mov     eax, 0
        pop     rbp
        ret

Here is a quote from a stackoverflow answer that explains what the registers RBP and RSP are doing:

rbp is the frame pointer on x86_64. In your generated code, it gets a snapshot of the stack pointer (rsp) so that when adjustments are made to rsp (i.e. reserving space for local variables or pushing values on to the stack), local variables and function parameters are still accessible from a constant offset from rbp.

Is RBP 16 here? For example rbp - 16 = 0, rbp - 12 = 4, rbp - 8 = 8 which kind of satisfies the formula.

Anyway, could you please give me an idea of how the formula from the textbook is related to the generated code from GCC?


Solution

  • arr[i]; as a C statement compiles to zero instructions, even in a debug build, since arr isn't volatile. All the asm you're seeing is just locals at fixed locations.


    If you're used to looking at AT&T syntax, use the output dropdown on Godbolt to uncheck "Intel Syntax".

    Also, you could get useful asm to look at by writing a function that takes a pointer arg and sums it for example, so you can enable light optimization without having the array optimize away to just returning a constant. You could use volatile to force it not to if you really want to see asm that stores array initializers to the stack instead of just taking a function arg. You're writing this to look at the asm, not run it, so don't write a main(). (In general see How to remove "noise" from GCC/clang assembly output?)

    int sumarr(int *arr){
        register int sum = 0; // register keyword is useful in GCC if you're going to disable optimization
        for (int i=0 ; i<1024 ; i++){
            sum += arr[i];
        }
        return sum;
    }
    

    Godbolt - gcc 12 with -Og uses an indexed addressing mode like you expect; most other optimization levels do a pointer increment.

    # gcc -Og
    sumarr:
            movl    $0, %eax
            movl    $0, %edx
            jmp     .L2                   # jump to the loop condition, redundant because 0 <= 1023 is known true; could just fall through into the loop, but -Og maybe intentionally preserves that execution?  It doesn't in general give consistent debugging.
    .L3:                                  # do {
            movslq  %eax, %rcx             # sign-extend int to intptr_t since -Og doesn't optimize much
            addl    (%rdi,%rcx,4), %edx    # edx += *(rdi + rcx*4) = array[i]
            addl    $1, %eax
    .L2:                                   # loop entry point for first iteration
            cmpl    $1023, %eax
            jle     .L3                   # }while(i<=1023)
            movl    %edx, %eax            # at low optimization levels, GCC didn't sum into the return-value register in the first place.
            ret
    

    With a normal level of optimization like -O2, we get a pointer increment:

    # GCC -O2 -fno-tree-vectorize        (-O2 by itself would vectorize with SIMD)
    sumarr:
            leaq    4096(%rdi), %rdx    # endp = ptr + len
            xorl    %eax, %eax          # sum = 0
    .L2:                                # do {
            addl    (%rdi), %eax
            addq    $4, %rdi
            cmpq    %rdx, %rdi
            jne     .L2                 # }while(p != endp)
            ret
    

    The amount of memory reserved for the local variables is always a multiple of 16 bytes, to keep the stack aligned to 16 bytes. in reality it should be 0 but it needs to be a multiple of 16?

    That's an over-simplification, but yes in this case 0*16 = 0 is a multiple of 16 which maintains stack alignment for RSP after push rbp. The actual space it uses is in the red zone, 128 bytes below RSP.

    The amount of stack space reserved for locals is always a multiple of 8, either odd or even depending on how many pushes it did, so the total movement of RSP is 16*n + 8.

    In this case where it doesn't need to preserve any call-preserved registers like RBX, it only pushed RBP (because -fno-omit-frame-pointer is the default at -O0.)

    It doesn't sub rsp, 16 because the x86-64 SysV ABI includes a red-zone; the 128 bytes below RSP are safe from signal handlers or anything stepping on them asynchronously, so can be used without "reserving". Use -mno-red-zone to stop GCC doing that, if you want to see it reserve space for an array.