As we know, in C++, we can pass an array's reference as an argument like f(int (&[N])
. Yes, it is syntax guaranteed by the iso standard, but I am curious about how the compiler works here. I found this thread, but unfortunately, this doesn't answer my question -- How is this syntax implemented by the compiler?
I then wrote a demo and hoped to see something from the assembly language:
void foo_p(int*arr) {}
void foo_r(int(&arr)[3]) {}
template<int length>
void foo_t(int(&arr)[length]) {}
int main(int argc, char** argv)
{
int arr[] = {1, 2, 3};
foo_p(arr);
foo_r(arr);
foo_t(arr);
return 0;
}
Originally, I guess it will still decay to the pointer, but will pass length implicitly via a register, then turn back into an array in the function body. But the assembly code tells me this is not true
void foo_t<3>(int (&) [3]):
push rbp #4.31
mov rbp, rsp #4.31
sub rsp, 16 #4.31
mov QWORD PTR [-16+rbp], rdi #4.31
leave #4.32
ret #4.32
foo_p(int*):
push rbp #1.21
mov rbp, rsp #1.21
sub rsp, 16 #1.21
mov QWORD PTR [-16+rbp], rdi #1.21
leave #1.22
ret #1.22
foo_r(int (&) [3]):
push rbp #2.26
mov rbp, rsp #2.26
sub rsp, 16 #2.26
mov QWORD PTR [-16+rbp], rdi #2.26
leave #2.27
ret #2.27
main:
push rbp #6.1
mov rbp, rsp #6.1
sub rsp, 32 #6.1
mov DWORD PTR [-16+rbp], edi #6.1
mov QWORD PTR [-8+rbp], rsi #6.1
lea rax, QWORD PTR [-32+rbp] #7.15
mov DWORD PTR [rax], 1 #7.15
lea rax, QWORD PTR [-32+rbp] #7.15
add rax, 4 #7.15
mov DWORD PTR [rax], 2 #7.15
lea rax, QWORD PTR [-32+rbp] #7.15
add rax, 8 #7.15
mov DWORD PTR [rax], 3 #7.15
lea rax, QWORD PTR [-32+rbp] #8.5
mov rdi, rax #8.5
call foo_p(int*) #8.5
lea rax, QWORD PTR [-32+rbp] #9.5
mov rdi, rax #9.5
call foo_r(int (&) [3]) #9.5
lea rax, QWORD PTR [-32+rbp] #10.5
mov rdi, rax #10.5
call void foo_t<3>(int (&) [3]) #10.5
mov eax, 0 #11.11
leave #11.11
ret #11.11
I admit that I am not familiar with the assembly language, but clearly, the three function's assembly codes are the same! So, something must happen before the assembler codes. Anyway, unlike the array, the pointer knows nothing about the length, right?
For Q2, my guess is for the complexity of the former C++ and C codes. After all, int[]
being equal to int*
in function parameters has been a tradition. Maybe one hundred years later, it will be deprecated?
Even C99 int foo(int arr[static 3])
is still just a pointer in asm. The static
syntax guarantees to the compiler that it can safely read all 3 elements even if the C abstract machine doesn't access some elements, so for example it could use a branchless cmov
for an if
.
The caller doesn't pass a length in a register because it's a compile-time constant and thus not needed at run-time.
You can pass arrays by value, but only if they're inside a struct or union. In that case, different calling conventions have different rules. What kind of C11 data type is an array according to the AMD64 ABI.
You'd almost never want to pass an array by value, so it makes sense that C doesn't have syntax for it, and that C++ never invented any either. Passing by constant reference (i.e. const int *arr
) is far more efficient; just a single pointer arg.
I put your code on the Godbolt compiler explorer, compiled with gcc -O3 -fno-inline-functions -fno-inline-functions-called-once -fno-inline-small-functions
to stop it from inlining the function calls. That gets rid of all the noise from -O0
debug-build and frame-pointer boilerplate. (I just searched the man page for inline
and disabled inlining options until I got what I wanted.)
Instead of -fno-inline-small-functions
and so on, you could use GNU C __attribute__((noinline))
on your function definitions to disable inlining for specific functions, even if they're static
.
I also added a call to a function without a definition, so the compiler needs to have arr[]
with the right values in memory, and added a store to arr[4]
in two of the functions. This lets us test whether the compiler warns about going outside the array bounds.
__attribute__((noinline, noclone))
void foo_p(int*arr) {(void)arr;}
void foo_r(int(&arr)[3]) {arr[4] = 41;}
template<int length>
void foo_t(int(&arr)[length]) {arr[4] = 42;}
void usearg(int*); // stop main from optimizing away arr[] if foo_... inline
int main()
{
int arr[] = {1, 2, 3};
foo_p(arr);
foo_r(arr);
foo_t(arr);
usearg(arr);
return 0;
}
gcc7.3 -O3 -Wall -Wextra
without function inlining, on Godbolt: Since I silenced the unused-args warnings from your code, the only warning we get is from the template, not from foo_r
:
<source>: In function 'int main()':
<source>:14:10: warning: array subscript is above array bounds [-Warray-bounds]
foo_t(arr);
~~~~~^~~~~
The asm output is:
void foo_t<3>(int (&) [3]) [clone .isra.0]:
mov DWORD PTR [rdi], 42 # *ISRA.3_4(D),
ret
foo_p(int*):
rep ret
foo_r(int (&) [3]):
mov DWORD PTR [rdi+16], 41 # *arr_2(D),
ret
main:
sub rsp, 24 # reserve space for the array and align the stack for calls
movabs rax, 8589934593 # this is 0x200000001: the first 2 elems
lea rdi, [rsp+4]
mov QWORD PTR [rsp+4], rax # MEM[(int *)&arr], first 2 elements
mov DWORD PTR [rsp+12], 3 # MEM[(int *)&arr + 8B], 3rd element as an imm32
call foo_r(int (&) [3])
lea rdi, [rsp+20]
call void foo_t<3>(int (&) [3]) [clone .isra.0] #
lea rdi, [rsp+4] # tmp97,
call usearg(int*) #
xor eax, eax #
add rsp, 24 #,
ret
The call to foo_p()
still got optimized away, probably because it doesn't do anything. (I didn't disable inter-procedural optimization, and even the noinline
and noclone
attributes didn't stop that.) Adding *arr=0;
to the function body results in a call to it from main
(passing a pointer in rdi
just like the other 2).
Notice the clone .isra.0
annotation on the demangled function name: gcc made a definition of the function that takes a pointer to arr[4]
rather than to the base element. That's why there's a lea rdi, [rsp+20]
to set up the arg, and why the store uses [rdi]
to deref the point with no displacement. __attribute__((noclone))
would stop that.
This inter-procedural optimization is pretty much trivial and saves 1 byte of code size in this case (just the disp8
in the addressing mode in the clone), but can be useful in other cases. The caller needs to know that its a definition for a modified version of the function, like void foo_clone(int *p) { *p = 42; }
, which is why it needs to encode that in the mangled symbol name.
If you'd instantiated the template in one file and called it from another file that couldn't see the definition, then without link-time optimization gcc would have to just call the regular name and pass a pointer to the array like the function as written.
IDK why gcc does this for the template but not the reference. It might be related to the fact it warns about the template version, but not the reference version. Or maybe it's related to main
deducing the template?
BTW, an IPO that would actually make it run slightly faster would be to let main
use mov rdi, rsp
instead of lea rdi, [rsp+4]
. i.e. take &arr[-1]
as the function arg, so the clone would use mov dword ptr [rdi+20], 42
.
But that's only helpful for callers like main
that have allocated an array 4 bytes above rsp
, and I think gcc is only looking for IPOs that make the function itself more efficient, not the calling sequence in one specific caller.