I have a recursive C function
foo (int numFruits) {
....
// recurse at some point
}
inside the main function.
The corresponding assembly would look like this:
.pos 0x500
main:
%r10 // contains numFruits
call foo
halt
.pos 0x4000
foo: // recursive
irmovq $8, %r13 // load value 8 into %r13
...
Inside foo, I am using constant value for size of quad which is 8 bytes long. (value 8 is not present in C code, but I am using this value to turn length of array to corresponding address, etc...)
If I load this value every time when foo is called recursively, I think it's wasting cycles. I was wondering whether compiler is able to optimize it such that constants are loaded before calling foo in the main?
Example: loading value 8 into r13 once before calling foo, so that this doesn't have to be loaded every single time. (provided that r13 is restored to its original state before loading value 8, after hitting halt)
If I were to save value 8 into r13 before main, would this be still preserving the spirit of foo(int numFruits) or is my change equivalent to foo(int numFruits, int quadSize)?
Thank you
It's equivalent to foo(int numFruits, long quadSize)
. Well maybe int quadSize
if your y86 ABI has 64-bit int
. All of the normal x86-64 ABIs have 32-bit int
, and Windows x64 even has 32-bit long
.
You also tagged this x86. x86-64 can move 8
into a 64-bit register with a 5-byte instruction like mov $8, %r13d
: 1 opcode byte + imm32. (Actually 6 bytes including the REX prefix). You only need mov r64, imm64
for constants that don't fit in a zero or sign-extended 32-bit immediate. Writing a 32-bit register zero-extends in to the full 64-bit register. You can even code-golf constant setup even more, at the cost of speed. Like push $imm8
/ pop %r13
in 3 bytes (actually 4 for a REX prefix). When optimizing for code size, you want to avoid r8..r15. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code/132985#132985.
I have no idea if y86 has efficient machine-code encodings for small constants.
There are no physical y86 CPUs in existence that I know of. There are emulators, but I'm not sure if there are even any virtual designs (like verilog) of y86 hardware that could be simulated in a cycle-accurate simulator.
So any talk of "saving cycles" is a stretch for y86. Real x86-64 CPUs are pipelined superscalar with out-of-order-execution, and often not bottlenecked on code-fetch. Especially in modern CPUs with a uop cache. Depending on the loop, extra mov-immediate instructions off the critical path might not slow things down any. https://agner.org/optimize/, and see performance links in the x86 tag wiki.
But yes, you should generally hoist constant-setup out of loops.
If your "loop" is recursion that you can't easily optimize into a normal loop without an expensive call
/ ret
, you can certainly make a wrapper function for public usage, and have it fall through into a private function that effectively uses a custom calling convention (which assumes that %r13 = 8
).
.globl foo
foo:
irmovq $8, %r13
# .p2align 4 # optional 16-byte alignment for the recursion entry point
# fall through
.Lprivate_foo:
# only reachable with r13=8
# blah blah using r13=8
call .Lprivate_foo
# blah blah still assuming r13=8
call .Lprivate_foo
# more stuff
ret # the final return
Nothing else can call private_foo
; it's a local label (.Lxxx
) only visible from this source. So the body of .Lprivate_foo
can assume that R13 = 8.
If r13
is a call-preserved register in your y86 calling convention (like it is in x86-64 System V), then either pick a call-clobbered register like r11, or have the public wrapper function call private_foo
so it can restore the caller's r13
before returning. Using a register that functions are normally allowed to clobber makes this near-zero extra overhead instead of introducing an extra level of call/ret.
But that only works if you don't call any other unknown functions from inside your recursive function, otherwise you have to assume they clobbered R11.
Optimizing recursion into a loop has a big advantage, and compilers do that whenever possible. (In a double-recursive function like a tree traversal, they will often turn the 2nd recursive call into a loop branch, but still actually recurse for the non-tail recursion.)
If you're just using 8
as a scale factor, I'm worried that you're using a multiply. It's much more efficient to use a shift by 3. Or (since you tagged this x86 as well as y86), maybe use a scaled-index addressing mode. But if it's for a pointer increment, then real x86 would use an add-immediate. Like add $8, %rsi
, using the add r/m64, imm8
encoding which only uses 1 byte for the constant (sign-extended to 64-bit).
But the x86 equivalent would be SIMD vector constants or floating point constants, because there aren't immediate forms of those. And in that case yes you do want set up the constant in a register outside of a loop.