c++arrays pointers assembly compiler-construction

From compiler perspective, how is reference for array dealt with, and, why passing by value(not decay) is not allowed?

As we know, in C++, we can pass an array's reference as an argument like f(int (&[N]). Yes, it is syntax guaranteed by the iso standard, but I am curious about how the compiler works here. I found this thread, but unfortunately, this doesn't answer my question -- How is this syntax implemented by the compiler?

I then wrote a demo and hoped to see something from the assembly language:

void foo_p(int*arr) {}
void foo_r(int(&arr)[3]) {}
template<int length>
void foo_t(int(&arr)[length]) {}
int main(int argc, char** argv)
{
    int arr[] = {1, 2, 3};
    foo_p(arr);
    foo_r(arr);
    foo_t(arr);
   return 0;
}

Originally, I guess it will still decay to the pointer, but will pass length implicitly via a register, then turn back into an array in the function body. But the assembly code tells me this is not true

void foo_t<3>(int (&) [3]):
  push rbp #4.31
  mov rbp, rsp #4.31
  sub rsp, 16 #4.31
  mov QWORD PTR [-16+rbp], rdi #4.31
  leave #4.32
  ret #4.32

foo_p(int*):
  push rbp #1.21
  mov rbp, rsp #1.21
  sub rsp, 16 #1.21
  mov QWORD PTR [-16+rbp], rdi #1.21
  leave #1.22
  ret #1.22

foo_r(int (&) [3]):
  push rbp #2.26
  mov rbp, rsp #2.26
  sub rsp, 16 #2.26
  mov QWORD PTR [-16+rbp], rdi #2.26
  leave #2.27
  ret #2.27

main:
  push rbp #6.1
  mov rbp, rsp #6.1
  sub rsp, 32 #6.1
  mov DWORD PTR [-16+rbp], edi #6.1
  mov QWORD PTR [-8+rbp], rsi #6.1
  lea rax, QWORD PTR [-32+rbp] #7.15
  mov DWORD PTR [rax], 1 #7.15
  lea rax, QWORD PTR [-32+rbp] #7.15
  add rax, 4 #7.15
  mov DWORD PTR [rax], 2 #7.15
  lea rax, QWORD PTR [-32+rbp] #7.15
  add rax, 8 #7.15
  mov DWORD PTR [rax], 3 #7.15
  lea rax, QWORD PTR [-32+rbp] #8.5
  mov rdi, rax #8.5
  call foo_p(int*) #8.5
  lea rax, QWORD PTR [-32+rbp] #9.5
  mov rdi, rax #9.5
  call foo_r(int (&) [3]) #9.5
  lea rax, QWORD PTR [-32+rbp] #10.5
  mov rdi, rax #10.5
  call void foo_t<3>(int (&) [3]) #10.5
  mov eax, 0 #11.11
  leave #11.11
  ret #11.11

live demo

I admit that I am not familiar with the assembly language, but clearly, the three function's assembly codes are the same! So, something must happen before the assembler codes. Anyway, unlike the array, the pointer knows nothing about the length, right?

Questions:

how does the compiler work here?
Now that the standard allows to pass an array by reference, does that mean it is trivial to implement? If so, why isn't passing by value allowed?

_{For Q2, my guess is for the complexity of the former C++ and C codes. After all, int[] being equal to int* in function parameters has been a tradition. Maybe one hundred years later, it will be deprecated?}

Solution

A C++ reference to an array is the same as a pointer to the first element, in assembly language.

Even C99 int foo(int arr[static 3]) is still just a pointer in asm. The static syntax guarantees to the compiler that it can safely read all 3 elements even if the C abstract machine doesn't access some elements, so for example it could use a branchless cmov for an if.

The caller doesn't pass a length in a register because it's a compile-time constant and thus not needed at run-time.

You can pass arrays by value, but only if they're inside a struct or union. In that case, different calling conventions have different rules. What kind of C11 data type is an array according to the AMD64 ABI.

You'd almost never want to pass an array by value, so it makes sense that C doesn't have syntax for it, and that C++ never invented any either. Passing by constant reference (i.e. const int *arr) is far more efficient; just a single pointer arg.

Removing compiler noise by enabling optimization:

I put your code on the Godbolt compiler explorer, compiled with gcc -O3 -fno-inline-functions -fno-inline-functions-called-once -fno-inline-small-functions to stop it from inlining the function calls. That gets rid of all the noise from -O0 debug-build and frame-pointer boilerplate. (I just searched the man page for inline and disabled inlining options until I got what I wanted.)

Instead of -fno-inline-small-functions and so on, you could use GNU C __attribute__((noinline)) on your function definitions to disable inlining for specific functions, even if they're static.

I also added a call to a function without a definition, so the compiler needs to have arr[] with the right values in memory, and added a store to arr[4] in two of the functions. This lets us test whether the compiler warns about going outside the array bounds.

__attribute__((noinline, noclone)) 
void foo_p(int*arr) {(void)arr;}
void foo_r(int(&arr)[3]) {arr[4] = 41;}

template<int length>
void foo_t(int(&arr)[length]) {arr[4] = 42;}

void usearg(int*); // stop main from optimizing away arr[] if foo_... inline

int main()
{
    int arr[] = {1, 2, 3};
    foo_p(arr);
    foo_r(arr);
    foo_t(arr);
    usearg(arr);
   return 0;
}

gcc7.3 -O3 -Wall -Wextra without function inlining, on Godbolt: Since I silenced the unused-args warnings from your code, the only warning we get is from the template, not from foo_r:

<source>: In function 'int main()':
<source>:14:10: warning: array subscript is above array bounds [-Warray-bounds]
     foo_t(arr);
     ~~~~~^~~~~

The asm output is:

void foo_t<3>(int (&) [3]) [clone .isra.0]:
    mov     DWORD PTR [rdi], 42       # *ISRA.3_4(D),
    ret
foo_p(int*):
    rep ret
foo_r(int (&) [3]):
    mov     DWORD PTR [rdi+16], 41    # *arr_2(D),
    ret

main:
    sub     rsp, 24             # reserve space for the array and align the stack for calls
    movabs  rax, 8589934593     # this is 0x200000001: the first 2 elems
    lea     rdi, [rsp+4]
    mov     QWORD PTR [rsp+4], rax    # MEM[(int *)&arr],  first 2 elements
    mov     DWORD PTR [rsp+12], 3     # MEM[(int *)&arr + 8B],  3rd element as an imm32
    call    foo_r(int (&) [3])
    lea     rdi, [rsp+20]
    call    void foo_t<3>(int (&) [3]) [clone .isra.0]    #
    lea     rdi, [rsp+4]      # tmp97,
    call    usearg(int*)     #
    xor     eax, eax  #
    add     rsp, 24   #,
    ret

The call to foo_p() still got optimized away, probably because it doesn't do anything. (I didn't disable inter-procedural optimization, and even the noinline and noclone attributes didn't stop that.) Adding *arr=0; to the function body results in a call to it from main (passing a pointer in rdi just like the other 2).

Notice the clone .isra.0 annotation on the demangled function name: gcc made a definition of the function that takes a pointer to arr[4] rather than to the base element. That's why there's a lea rdi, [rsp+20] to set up the arg, and why the store uses [rdi] to deref the point with no displacement. __attribute__((noclone)) would stop that.

This inter-procedural optimization is pretty much trivial and saves 1 byte of code size in this case (just the disp8 in the addressing mode in the clone), but can be useful in other cases. The caller needs to know that its a definition for a modified version of the function, like void foo_clone(int *p) { *p = 42; }, which is why it needs to encode that in the mangled symbol name.

If you'd instantiated the template in one file and called it from another file that couldn't see the definition, then without link-time optimization gcc would have to just call the regular name and pass a pointer to the array like the function as written.

IDK why gcc does this for the template but not the reference. It might be related to the fact it warns about the template version, but not the reference version. Or maybe it's related to main deducing the template?

BTW, an IPO that would actually make it run slightly faster would be to let main use mov rdi, rsp instead of lea rdi, [rsp+4]. i.e. take &arr[-1] as the function arg, so the clone would use mov dword ptr [rdi+20], 42.

But that's only helpful for callers like main that have allocated an array 4 bytes above rsp, and I think gcc is only looking for IPOs that make the function itself more efficient, not the calling sequence in one specific caller.