I was wondering whether x**2 or x*x is faster
def sqr(x):
for i in range (20):
x = x**2
return x
def sqr_(x):
for i in range (20):
x = x*x
return x
When I time it, this is what I get:
The time it takes for x**2: 101230500
The time it takes for x*x: 201469200
I have tried it many many times, they are either equal, or x ** 2 is faster than x * x. But x*x is never faster than x**2.
So I disassembled the code:
For x**2:
5 12 LOAD_FAST 0 (x)
14 LOAD_CONST 2 (2)
16 BINARY_POWER
18 STORE_FAST 0 (x)
20 JUMP_ABSOLUTE 8
For x*x:
9 12 LOAD_FAST 0 (x)
14 LOAD_FAST 0 (x)
16 BINARY_MULTIPLY
18 STORE_FAST 0 (x)
20 JUMP_ABSOLUTE 8
Is it about load_const being slightly faster than load_fast?
LOAD_CONST: takes the literal value at index 1 of co_consts and pushes it
LOAD_FAST is accessing the value in an array by index
Or binary_power is faster than binary_multiply (I actually don't know the binary_power algorithm)?
For small integers, x*x
is significantly faster than x**2
since CPython does a lot more operation internally to compute a**b
. Actually, on my machine x*x
is 4 times faster (processor i5-9600KF, CPython 3.8.1, on Windows). That being said, in you code, numbers grows very quickly and Python integers are unbounded. In fact, each exponentiation cause the binary representation to be twice bigger. The exponents are multiplied together resulting in the computation of x**(2*2*2*...*2) = x**(2**20) = x**1048576
. For big x=2
, the number takes 128 KiB in memory and for x=100
it takes 850 KiB. This is pretty big. Each iteration of your loop is bounded by the computation of such huge numbers in memory. As a result, for large numbers, x*x
and x**2
are as fast because the same internal computation is done for both cases and the overhead of the CPython interpreter becomes negligible compared to the computation of the huge integers.
Internally, CPython appears to use _PyNumber_PowerNoMod
which calls PyNumber_Power
which calls ternary_op
, and PyNumber_Multiply
which calls binary_op1
. Note that CPython is not optimized to compute x**2
: internally CPython compute pow(x, 2, None)
which is the function to compute a modular exponentiation (though the call to pow
is a bit less efficient way as CPython has to check pow
has not been overwritten). This modular exponentiation function is much more expensive since it is a very generic function compared to x * x
.
In the end, it appears long_mul
and long_pow
are called in your case (note that long_pow
calls long_mul
internally so long_pow
actually need to compute more instructions).
For large numbers, CPython uses the Karatsuba multiplication (see: k_mul
).
Note that CPython is actually very slow in both cases since it takes several nanoseconds (at least on my machine) and performs dozens of checks and many function calls just to multiply two integers. This can be done natively in only 1 cycle for 64-bit integers on mainstream x86-64 processors. Large integers cannot be natively computed by mainstream processor and require a much more expensive computation.