I understand that I need to push the Link Register at the beginning of a function call, and pop that value to the Program Couter before returning, so that the execution can carry one from where it was before the function call.
What I don't understand is why most people do this by adding an extra register to the push/pop. For instance:
push {ip, lr}
...
pop {ip, pc}
For instance, here's a Hello World in ARM, provided by the official ARM blog:
.syntax unified
@ --------------------------------
.global main
main:
@ Stack the return address (lr) in addition to a dummy register (ip) to
@ keep the stack 8-byte aligned.
push {ip, lr}
@ Load the argument and perform the call. This is like 'printf("...")' in C.
ldr r0, =message
bl printf
@ Exit from 'main'. This is like 'return 0' in C.
mov r0, #0 @ Return 0.
@ Pop the dummy ip to reverse our alignment fix, and pop the original lr
@ value directly into pc — the Program Counter — to return.
pop {ip, pc}
@ --------------------------------
@ Data for the printf calls. The GNU assembler's ".asciz" directive
@ automatically adds a NULL character termination.
message:
.asciz "Hello, world.\n"
Question 1: what's the reason for the "dummy register" as they call it? Why not simply push{lr} and pop{pc}? They say it's to keep the stack 8-byte aligned, but ain't the stack 4-byte aligned?
Question 2: what register is "ip" (i.e., r7 or what?)
what's the reason for the "dummy register" as they call it? Why not simply push{lr} and pop{pc}? They say it's to keep the stack 8-byte aligned, but ain't the stack 4-byte aligned?
The stack only requires 4-byte alignment; but if the data bus is 64 bits wide (as it is on many modern ARMs), it's more efficient to keep it at an 8-byte alignment. Then, for example, if you call a function that needs to stack two registers, that can be done in a single 64-bit write rather than two 32-bit writes.
UPDATE: Apparently it's not just for efficiency; it's a requirement of the official procedure call standard, as noted in the comments.
If you're targetting older 32-bit ARMs, then the extra stacked register might degrade performance slightly.
what register is "ip" (i.e., r7 or what?)
r12
. See, for example, here for the full set of register aliases used by the procedure call standard.