To the best of my knowledge, in low level languages such as C, it is generally advisable to keep the number of arguments to functions to 6 or lower, since then there is no need to pass arguments on the stack (i.e there are enough registers), and sometimes no need to even create a stack frame for a function.
Does this logic still apply in Python? Or is there any transformation done on function arguments, when the interpreter is called, that makes this point irrelevant/moot?
I'm well aware that realistically, the performance gains are negligible if they exist at all (and for this type of optimization, it's best to just switch to cython, or something else altogether) but I would like to understand Python better.
On Python 3.8.10 (on an x86-64 machine, Ubuntu 20.04), I tried using dis.dis()
to look at the bytecode disassembly of some minimal example:
import random
def foo(a, b, c, d, e, f, g):
return a+b+c+d+e+f+g
a = random.randint(0, 10)
b = random.randint(0, 10)
c = random.randint(0, 10)
d = random.randint(0, 10)
e = random.randint(0, 10)
f = random.randint(0, 10)
g = random.randint(0, 10)
foo(a, b, c, d, e, f, g)
(using random just to make sure there aren't any optimisation shenanigans).
This resulted in this bytecode for the last line of the code (trimmed for brevity):
...
12 100 LOAD_NAME 1 (foo)
102 LOAD_NAME 3 (a)
104 LOAD_NAME 4 (b)
106 LOAD_NAME 5 (c)
108 LOAD_NAME 6 (d)
110 LOAD_NAME 7 (e)
112 LOAD_NAME 8 (f)
114 LOAD_NAME 9 (g)
116 CALL_FUNCTION 7
118 POP_TOP
120 LOAD_CONST 1 (None)
122 RETURN_VALUE
...
However I'm not familiar with the bytecode, specifically LOAD_NAME, if there is any internal logic to separate loading into registers from loading onto the stack.
No.
Not even close.
Not really
Python code level is a high abstraction and very, very dettached from the actual underlying architecture.
Arguments for function calls will be collected, each in a couple Python bytecode instructions - and each bytecode will execute at least tens, but typically hundreds of lines of code-equivalent c-level instructions.
Moreover, most calls will even build a temporary tuple object which will be de-structured again (though there are likely optimizations in place to avoid that in pure python-to-python calls nowadays).
That said, even when coding C that level of parent optimization is nonsense: a shallow stack would likely use l1 CPU cache and make no difference to full in-register parameters on a modern, desktop/notebook class CPU.