Why is complex multiplication nearly as fast as integer multiplication in python?

I was under the impression that complex number multiplication takes longer than real number multiplication, since it requires 3 multiplications.

However I tried the following:

a, b = 3, 4
c, d = 5, 6
print(a*c - b*d, a*d + b*c)

e = 3+4j
f = 5+6j
print(e*f)

%timeit a*c - b*d
%timeit a*d + b*c
%timeit e*f
%timeit a*b

and got

-9 38
(-9+38j)
110 ns ± 0.2 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
102 ns ± 0.193 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
47.2 ns ± 0.942 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
43 ns ± 1.52 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

Why is the complex number multiplication nearly as fast as the real number complication?

I understand that Python is a high level language, but this results still confuse me...

The only explanantion I can think of is that I am reaching such low speeds with both computations that somehow the bottleneck is not the computation itself but something else..

Here is the original question that I wanted to ask

I have this version of Discrete Fourier Transform that uses complex operations:

from math import e, pi, cos, sin, log2

def W(n, N):
    theta = n * -1 * 2 * pi / N
    return cos(theta) + 1j*sin(theta)

def dft(signal):
    N = len(signal)
    total = []
    for k in range(N):
        s = 0
        for n in range(N):
            s += signal[n] * W(n*k, N)
        total.append(s)
    return total

I also have this version of DFT that uses no complex operations and instead uses two real signals, as the imaginary part of the signal is encoded into a real array.

def Wri(n, N):
    theta = n * -1 * 2 * pi / N
    return cos(theta), sin(theta)

def dft_ri(signal_r, signal_i):
    N = len(signal_r)
    tr, ti = [], []
    for k in range(N):
        sr = 0
        si = 0
        for n in range(N):
            a = signal_r[n]
            b = signal_i[n]
            c, d = Wri(n*k, N)
            sr += a*c - b*d
            si += a*d + b*c
        tr.append(sr)
        ti.append(si)
    return tr, ti

Using the following complex signal

x = np.linspace(0, T, 1024)
def signal2(t):
    return np.where(np.logical_and((t%T>=0), (t%T<np.pi)), t%T+t%T*1j, np.pi+np.pi*1j)

s = list(signal2(x))
sr = list(np.real(s))
si = list(np.imag(s))

The two W's take the same amount of time roughly

%timeit W(8, 3)
%timeit Wri(8, 3)

outputs

464 ns ± 0.756 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
424 ns ± 2.21 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

However

%timeit dft(s)
%timeit dft_ri(sr, si)

outputs

935 ms ± 4.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.6 s ± 3.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Question

Why does the DFT without complex operations take nearly twice as long as the DFT with complex operations?

Solution

Your timings are almost 100% overhead. The actual hardware-level multiplication is only a tiny fraction of the cost either way.

Try it with NumPy arrays, which have far lower per-element overhead, and you see a different story:

In [1]: import numpy

In [2]: x = numpy.ones(10000)

In [3]: y = x.astype(complex)

In [4]: %timeit x*x
4.2 µs ± 51.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [5]: %timeit y*y
16.6 µs ± 696 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [6]: z = x.astype(int)

In [7]: %timeit z*z
6.54 µs ± 1 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)

That's a per-element time in the vicinity of 1 nanosecond, while your timings are more like 40 to 100 per element.

The performance differences you're seeing with your FFT implementations are also due to overhead. Very little of your runtime is actually spent doing math.