ARM Link register - non-leaf subroutine

I am wondering about, where the Link register is used in ARM CPU. As I understand it is storing return address of functions. But does every return address go to this register after function call or it is only related to leaf subroutine implementation? How it is performed in functions, that have to use stack (for storing data or additional return addresses) - is LR still used here in any way?

Solution

BL instruction

Operation
  if ConditionPassed(cond) then
  LR = address of the instruction after the branch instruction
  PC = PC + (SignExtend(signed_immed_24) << 2)

Usage
  The BL instruction is used to perform a subroutine call. The return
  from subroutine is achieved by copying the LR to the PC. Typically, 
  this is done by one of the following methods:
  - Executing a BX R14 instruction.
  - Executing a MOV PC,R14 instruction.

And newer ARMs go on to allow for pop {lr} and other...

Seems quite clear to me what the usage of LR is.

You can easily try it yourself as well:

unsigned int more_fun ( unsigned int );
unsigned int fun0 ( unsigned int x )
{
    return(x+1);
}
unsigned int fun1 ( unsigned int x )
{
    return(more_fun(x)+1);
}
unsigned int fun2 ( unsigned int x )
{
    return(more_fun(x));
}
unsigned int fun3 ( unsigned int x )
{
    return(3);
}

00000000 <fun0>:
   0:   e2800001    add r0, r0, #1
   4:   e12fff1e    bx  lr

00000008 <fun1>:
   8:   e92d4010    push    {r4, lr}
   c:   ebfffffe    bl  0 <more_fun>
  10:   e8bd4010    pop {r4, lr}
  14:   e2800001    add r0, r0, #1
  18:   e12fff1e    bx  lr

0000001c <fun2>:
  1c:   e92d4010    push    {r4, lr}
  20:   ebfffffe    bl  0 <more_fun>
  24:   e8bd4010    pop {r4, lr}
  28:   e12fff1e    bx  lr

0000002c <fun3>:
  2c:   e3a00003    mov r0, #3
  30:   e12fff1e    bx  lr

Because, as documented, bl modifies the link register. In order to return from a non-leaf function you need to preserve the link register for that call, the return address. So you push it on the stack. The convention for this compiler wants the stack 64 bit aligned, so the addition of the r4 register is simply to facilitate that alignment and r4 is otherwise not involved here.

You can see in the leaf function it does not use the stack because it has no reason to do so, the link register does not get modified during the function and in this case the function is too simple to need the stack for other reasons. If you were to need the stack and be a leaf function the optimizer will not need to put lr on the stack, but if for alignment reasons it needs another register, who knows they are free to use r14 as well as one of many of the other registers.

Now if we force something on the stack (non-leaf)

unsigned int new_fun ( unsigned int, unsigned int );
unsigned int fun4 ( unsigned int x, unsigned int y)
{
    return(new_fun(x,y)+y);
}

00000034 <fun4>:
  34:   e92d4010    push    {r4, lr}
  38:   e1a04001    mov r4, r1
  3c:   ebfffffe    bl  0 <new_fun>
  40:   e0800004    add r0, r0, r4
  44:   e8bd4010    pop {r4, lr}
  48:   e12fff1e    bx  lr

lr has to be on the stack because a bl is used to call the next function. In this case per the convention they chose to use r4 to save the y variable (in r1 coming in) so that it can be used after the return of the nested call. Since only two registers need to be preserved, and that fits with the stack alignment rule then r4 and lr are saved and in this case both are used (r4 is not just to align the stack).

Not sure what you mean by additional return addresses. Perhaps you are thinking as each function makes a call there a return address on the stack to preserve that address, and that is true but you really only need to look at it one function at a time, that is the beauty of calling conventions. And in that case for this architecture using ideally bl to make function calls (as pointed out in another answer they don't have to, but it would be silly not to) that means lr is modified for every call to a subroutine and as a result the calling function then loses its return address to its caller, so it needs to preserve it locally some how. As we saw with fun 4, technically they could for example:

fun2:
 push {r4, r5}
 mov r5,lr
 bl 0 <more_fun>
 mov r1,r5
 pop {r4, r5}
 bx r1

and not actually save lr on the stack. Newer ARMs than the one I am building for you will see this

00000008 <fun1>:
   8:   e92d4010    push    {r4, lr}
   c:   ebfffffe    bl  0 <more_fun>
  10:   e2800001    add r0, r0, #1
  14:   e8bd8010    pop {r4, pc}

00000018 <fun2>:
  18:   eafffffe    b   0 <more_fun>

The contents of lr is on the stack (lr itself of course is a register it can't be "on the stack", but after armv4t you can pop to the pc and change modes between arm and thumb (where before only bx could be used for thumb interwork).

Also note the tail optimization for fun2. This means that fun2 did not even push the return address on the stack.

Seems pretty obvious if you look at the arm docs how lr is used. And then think about how a compiler would implement a standard function, and then what optimizations they might do. And of course you can then just try it and see what certain compilers actually generate.