why is there a b (branch) instruction just before the end of a procedure in many arm64 procedures?

This is from linux source arch/arm64/kernel/head.S showing the kernel start. The code first calls preserve_boot_args and next calls el2_setup using bl(branch and link). I showed the procedure preserve_boot_args also.

SYM_CODE_START(primary_entry)
        bl      preserve_boot_args
        bl      el2_setup                       // Drop to EL1, w0=cpu_boot_mode
        adrp    x23, __PHYS_OFFSET
        and     x23, x23, MIN_KIMG_ALIGN - 1    // KASLR offset, defaults to 0
        bl      set_cpu_boot_mode_flag
        bl      __create_page_tables
        /*
         * The following calls CPU setup code, see arch/arm64/mm/proc.S for
         * details.
         * On return, the CPU will be ready for the MMU to be turned on and
         * the TCR will have been set.
         */
        bl      __cpu_setup                     // initialise processor
        b       __primary_switch
SYM_CODE_END(primary_entry)

SYM_CODE_START_LOCAL(preserve_boot_args)
        mov     x21, x0                         // x21=FDT

        adr_l   x0, boot_args                   // record the contents of
        stp     x21, x1, [x0]                   // x0 .. x3 at kernel entry
        stp     x2, x3, [x0, #16]

        dmb     sy                              // needed before dc ivac with
                                                // MMU off

        mov     x1, #0x20                       // 4 x 8 bytes
        b       __inval_dcache_area             // tail call
SYM_CODE_END(preserve_boot_args)

As I understand, bl is for calling procedure (after procedure, return to the address kept in the lr - link register, x30) and b is just going to the labeled address not returning. But in procedure preserve_boot_args above, just at the end, there is b __inval_dcache_area instruction which just goes to __inval_dcache_area without returning. Then how does it return to the original code (where bl el2_setup is)? And how does a procedure ends itself? The definition of SYM_CODE_END is like this :

#define SYM_END(name, sym_type)                         \
        .type name sym_type ASM_NL                      \
        .size name, .-name
#endif

I can't understand how this code makes it return to the address in lr. Shouldn't we do something like mv pc, lr?

Solution

This looks like a call optimization — sometimes called tail call optimization, which is useful in reducing the stack depth on recursion — but also useful in the general case.

The way this optimization works, a caller, A, calls a function, B, which calls another function, C. If B was going to return directly to A after calling C, then B can jump to C instead! Being none the wiser, C returns to its caller, which appears to be A. By doing this, B doesn't need a stack frame and doesn't have to save the link register — it simply passes its return address on to C.

This optimization skips the otherwise normal return of C to B, making C return directly to A. This transformation is only enabled (i.e. correct) under certain circumstances:

If there is no work for B to do upon C's return, B can setup C to return directly to A.
From a logical perspective (e.g. in C or pseudo code), this means that either:
- B and C are both void functions, or,
- B ignores C's return value, or,
- B returns C's return value to A, unmodified.
B also cannot cleanup a stack frame after C's return, since C returns directly to A; if B has a stack frame, it would have to be released before calling C. (And also see @PeterCordes comment below.)

From a hardware perspective, when the optimization is used (it is coded in B and then B is invoked), it is as if B & C were merged: Function A calls "BC" if you will. Dynamically, there is one bl (A->BC) and one ret (BC->A) — nicely balanced, which is good for the hardware branch predictor's call-stack handling.

We cannot express the tail call optimization in most high level languages, as most have only "call subroutine" and do not have a "jump to subroutine" feature. So, at best, we can write code that does no work upon return as per the above, and let the language/compiler perform the optimization, if it knows the optimization.

In A calls B calls C, B & C are functions, but A may or may not be a function (it could just some assembly code — while it is a caller of B, A itself doesn't need to be invoked or invokable as a function. While the call chain can be deep, the first code at the very top a call chain is not a function (e.g. it is _start or sometimes main) and has no where to return to (so doesn't use ret to exit; it doesn't have a return address parameter provided by a caller). (If code has a place to return, i.e. a return address to use, then by definition it is not the top of the call chain (it is nominally a function).)

This initial code can play the role of A but not of B or C in the pattern. Tail call is precluded for A's call to B when A is not a function because there is no caller of A for B to return to. This is why the pattern must be A calls B calls C, B & C must be functions, and we consider applying the optimization to B. If A is a function, it must have a caller, so then can play the role of the middle function in the pattern (as can C, e.g. if C calls D).