Search code examples
functionassemblyx86-64calling-convention

How the number of function arguments affects performance


When coding a function, I usually recall the "Clean Code" principle of

A function shouldn’t have more than 3 arguments.

However, given these x86-64 calling conventions below, I've relaxed it to 4 arguments because that covers cross-platform functions by ensuring that CPU registers are utilized versus stack operations which are slower than register access.

System V AMD64 ABI

Integer/Pointer Arguments 1-6:  RDI, RSI, RDX, RCX, R8, R9
Floating Point Arguments 1-8:   XMM0 - XMM7
Excess Arguments:               Stack

Microsoft x64 calling convention

Integer/Pointer Arguments 1-4:  RCX, RDX, R8, R9
Floating Point Arguments 1-4:   XMM0 - XMM3
Excess Arguments:               Stack

Example

Limiting the number of arguments to 4, ensures both Windows and Linux use registers instead of the stack.

#include <stdio.h>

int Add(int a, int b, int c, int d) { return a + b + c + d; }

int main() {
    int sum = Add(1,2,3,4);
    return 0;
}

Linux Disassembly

mov     ecx, 4
mov     edx, 3
mov     esi, 2
mov     edi, 1
call    Add

Windows Disassembly

mov     r9d,4  
mov     r8d,3  
mov     edx,2  
mov     ecx,1  
call    Add

Question

When writing cross-platform functions, is this micro-optimization a viable new mantra?

A function shouldn’t have more than 4 arguments.

Note

Specific use case is assembling code with UASM (understands MASM syntax) on windows and linux.

For compilers (msvc, gcc), optimizer tools would handle performance, however, for assemblers (masm, nasm, uasm) there are no such tools.

The 4 args mantra was chosen for performance (not a coding style) so that the generated code started from an optimized state.


Solution

  • The "clean code" principle you're talking about most likely refers to exactly that: code tidiness and readability. Many parameters usually means long lines and/or parameter lists broken into multiple lines (just look at Win32 functions with 10 arguments).

    However, since it seems that you are asking this from a strictly performance perspective, it doesn't matter how many parameters a function has. Rather than reducing the number of arguments, you should try to reduce the number of calls. A good way to do this is to write reasonably short functions which the compiler can inline.

    Let's suppose that instead of 4 numbers, my code needs to add a batch of 4 and a batch of 5. With a 4 parameter function, I will need to do this:

    #include <stdio.h>
    
    int Add(int a, int b, int c, int d) { return a + b + c + d; }
    
    int main() {
        int sum1 = Add(1,2,3,4);
        int sum2 = Add(1,2,3,Add(4,5,0,0));
        return 0;
    }
    

    Now let's say I use a 5-parameter function:

    #include <stdio.h>
    
    int Add(int a, int b, int c, int d, int e) { return a + b + c + d + e; }
    
    int main() {
        int sum1 = Add(1,2,3,4,0);
        int sum2 = Add(1,2,3,4,5);
        return 0;
    }
    

    As you can see, the second version only uses 2 calls (because I need 2 different sums). The first version used 3 calls. So you're trading 4 mov stack accesses (in the 5-arg version) for a whole extra call/ret pair. Which, mind you, also make 2 stack accesses for the return address (while possibly also introducing additional overhead in the form of shadow stack accesses, canary values placed in the new function's stack frame etc).

    Of course, this example uses an Add function which is inlined by the compiler (actually, it's not, the compiler detects that the sum is not needed and the Add function has no additional side effects and generates an empty main). But with a realistic function that's more complex it will almost always be better to call it fewer times.

    This is, however, just the tip of the iceberg when it comes to performance. If you worry about performance, get a profiler and find the bottlenecks in your code. If your code is slow in a badly written loop you should only optimize that by hand. Leave most of the optimization to the compiler itself, because it knows what it's doing and has been refined for decades.

    Your requirements are also somewhat contradictory. You are writing cross-platform functions but also the Specific use case is assembling code with UASM?