Search code examples
cassemblyx86y86

Assembly optimizing function call, taking constants out of function allowed?


I have a recursive C function

foo (int numFruits) {
   ....
   // recurse at some point
  } 

inside the main function.

The corresponding assembly would look like this:

.pos 0x500
main:
   %r10  // contains numFruits
   call foo
   halt

.pos 0x4000
foo: // recursive
  irmovq $8, %r13 // load value 8 into %r13
  ...

Inside foo, I am using constant value for size of quad which is 8 bytes long. (value 8 is not present in C code, but I am using this value to turn length of array to corresponding address, etc...)

If I load this value every time when foo is called recursively, I think it's wasting cycles. I was wondering whether compiler is able to optimize it such that constants are loaded before calling foo in the main?

Example: loading value 8 into r13 once before calling foo, so that this doesn't have to be loaded every single time. (provided that r13 is restored to its original state before loading value 8, after hitting halt)

If I were to save value 8 into r13 before main, would this be still preserving the spirit of foo(int numFruits) or is my change equivalent to foo(int numFruits, int quadSize)?

Thank you


Solution

  • It's equivalent to foo(int numFruits, long quadSize). Well maybe int quadSize if your y86 ABI has 64-bit int. All of the normal x86-64 ABIs have 32-bit int, and Windows x64 even has 32-bit long.

    You also tagged this x86. x86-64 can move 8 into a 64-bit register with a 5-byte instruction like mov $8, %r13d: 1 opcode byte + imm32. (Actually 6 bytes including the REX prefix). You only need mov r64, imm64 for constants that don't fit in a zero or sign-extended 32-bit immediate. Writing a 32-bit register zero-extends in to the full 64-bit register. You can even code-golf constant setup even more, at the cost of speed. Like push $imm8 / pop %r13 in 3 bytes (actually 4 for a REX prefix). When optimizing for code size, you want to avoid r8..r15. https://codegolf.stackexchange.com/questions/132981/tips-for-golfing-in-x86-x64-machine-code/132985#132985.

    I have no idea if y86 has efficient machine-code encodings for small constants.

    There are no physical y86 CPUs in existence that I know of. There are emulators, but I'm not sure if there are even any virtual designs (like verilog) of y86 hardware that could be simulated in a cycle-accurate simulator.

    So any talk of "saving cycles" is a stretch for y86. Real x86-64 CPUs are pipelined superscalar with out-of-order-execution, and often not bottlenecked on code-fetch. Especially in modern CPUs with a uop cache. Depending on the loop, extra mov-immediate instructions off the critical path might not slow things down any. https://agner.org/optimize/, and see performance links in the x86 tag wiki.


    But yes, you should generally hoist constant-setup out of loops.

    If your "loop" is recursion that you can't easily optimize into a normal loop without an expensive call / ret, you can certainly make a wrapper function for public usage, and have it fall through into a private function that effectively uses a custom calling convention (which assumes that %r13 = 8).

    .globl foo
    foo:
        irmovq  $8, %r13
    
     # .p2align 4    # optional 16-byte alignment for the recursion entry point
     # fall through
    
    .Lprivate_foo:
       # only reachable with r13=8
     # blah blah using r13=8
        call .Lprivate_foo
     # blah blah still assuming r13=8
        call .Lprivate_foo
     # more stuff
        ret                      # the final return 
    

    Nothing else can call private_foo; it's a local label (.Lxxx) only visible from this source. So the body of .Lprivate_foo can assume that R13 = 8.

    If r13 is a call-preserved register in your y86 calling convention (like it is in x86-64 System V), then either pick a call-clobbered register like r11, or have the public wrapper function call private_foo so it can restore the caller's r13 before returning. Using a register that functions are normally allowed to clobber makes this near-zero extra overhead instead of introducing an extra level of call/ret.

    But that only works if you don't call any other unknown functions from inside your recursive function, otherwise you have to assume they clobbered R11.

    Optimizing recursion into a loop has a big advantage, and compilers do that whenever possible. (In a double-recursive function like a tree traversal, they will often turn the 2nd recursive call into a loop branch, but still actually recurse for the non-tail recursion.)


    If you're just using 8 as a scale factor, I'm worried that you're using a multiply. It's much more efficient to use a shift by 3. Or (since you tagged this x86 as well as y86), maybe use a scaled-index addressing mode. But if it's for a pointer increment, then real x86 would use an add-immediate. Like add $8, %rsi, using the add r/m64, imm8 encoding which only uses 1 byte for the constant (sign-extended to 64-bit).

    But the x86 equivalent would be SIMD vector constants or floating point constants, because there aren't immediate forms of those. And in that case yes you do want set up the constant in a register outside of a loop.