I am wondering about, where the Link register is used in ARM CPU. As I understand it is storing return address of functions. But does every return address go to this register after function call or it is only related to leaf subroutine implementation? How it is performed in functions, that have to use stack (for storing data or additional return addresses) - is LR still used here in any way?
BL instruction
Operation
if ConditionPassed(cond) then
LR = address of the instruction after the branch instruction
PC = PC + (SignExtend(signed_immed_24) << 2)
Usage
The BL instruction is used to perform a subroutine call. The return
from subroutine is achieved by copying the LR to the PC. Typically,
this is done by one of the following methods:
- Executing a BX R14 instruction.
- Executing a MOV PC,R14 instruction.
And newer ARMs go on to allow for pop {lr} and other...
Seems quite clear to me what the usage of LR is.
You can easily try it yourself as well:
unsigned int more_fun ( unsigned int );
unsigned int fun0 ( unsigned int x )
{
return(x+1);
}
unsigned int fun1 ( unsigned int x )
{
return(more_fun(x)+1);
}
unsigned int fun2 ( unsigned int x )
{
return(more_fun(x));
}
unsigned int fun3 ( unsigned int x )
{
return(3);
}
00000000 <fun0>:
0: e2800001 add r0, r0, #1
4: e12fff1e bx lr
00000008 <fun1>:
8: e92d4010 push {r4, lr}
c: ebfffffe bl 0 <more_fun>
10: e8bd4010 pop {r4, lr}
14: e2800001 add r0, r0, #1
18: e12fff1e bx lr
0000001c <fun2>:
1c: e92d4010 push {r4, lr}
20: ebfffffe bl 0 <more_fun>
24: e8bd4010 pop {r4, lr}
28: e12fff1e bx lr
0000002c <fun3>:
2c: e3a00003 mov r0, #3
30: e12fff1e bx lr
Because, as documented, bl modifies the link register. In order to return from a non-leaf function you need to preserve the link register for that call, the return address. So you push it on the stack. The convention for this compiler wants the stack 64 bit aligned, so the addition of the r4 register is simply to facilitate that alignment and r4 is otherwise not involved here.
You can see in the leaf function it does not use the stack because it has no reason to do so, the link register does not get modified during the function and in this case the function is too simple to need the stack for other reasons. If you were to need the stack and be a leaf function the optimizer will not need to put lr on the stack, but if for alignment reasons it needs another register, who knows they are free to use r14 as well as one of many of the other registers.
Now if we force something on the stack (non-leaf)
unsigned int new_fun ( unsigned int, unsigned int );
unsigned int fun4 ( unsigned int x, unsigned int y)
{
return(new_fun(x,y)+y);
}
00000034 <fun4>:
34: e92d4010 push {r4, lr}
38: e1a04001 mov r4, r1
3c: ebfffffe bl 0 <new_fun>
40: e0800004 add r0, r0, r4
44: e8bd4010 pop {r4, lr}
48: e12fff1e bx lr
lr has to be on the stack because a bl is used to call the next function. In this case per the convention they chose to use r4 to save the y variable (in r1 coming in) so that it can be used after the return of the nested call. Since only two registers need to be preserved, and that fits with the stack alignment rule then r4 and lr are saved and in this case both are used (r4 is not just to align the stack).
Not sure what you mean by additional return addresses. Perhaps you are thinking as each function makes a call there a return address on the stack to preserve that address, and that is true but you really only need to look at it one function at a time, that is the beauty of calling conventions. And in that case for this architecture using ideally bl to make function calls (as pointed out in another answer they don't have to, but it would be silly not to) that means lr is modified for every call to a subroutine and as a result the calling function then loses its return address to its caller, so it needs to preserve it locally some how. As we saw with fun 4, technically they could for example:
fun2:
push {r4, r5}
mov r5,lr
bl 0 <more_fun>
mov r1,r5
pop {r4, r5}
bx r1
and not actually save lr on the stack. Newer ARMs than the one I am building for you will see this
00000008 <fun1>:
8: e92d4010 push {r4, lr}
c: ebfffffe bl 0 <more_fun>
10: e2800001 add r0, r0, #1
14: e8bd8010 pop {r4, pc}
00000018 <fun2>:
18: eafffffe b 0 <more_fun>
The contents of lr is on the stack (lr itself of course is a register it can't be "on the stack", but after armv4t you can pop to the pc and change modes between arm and thumb (where before only bx could be used for thumb interwork).
Also note the tail optimization for fun2. This means that fun2 did not even push the return address on the stack.
Seems pretty obvious if you look at the arm docs how lr is used. And then think about how a compiler would implement a standard function, and then what optimizations they might do. And of course you can then just try it and see what certain compilers actually generate.