Why is complex conjugation so slow?

It's on par with complex multiplication, which boggles my mind:

import numpy as np

def op0(x):
    return np.conj(x)
    
def op1(x0, x1):
    return x0 * x1
    
def op2(x0, x1):
    x0[:] = x1

for N in (50, 500, 5000):
    print(f"\nshape = ({N}, {N})")
    x0 = np.random.randn(N, N) + 1j*np.random.randn(N, N)
    x1 = np.random.randn(N, N) + 1j*np.random.randn(N, N)

    %timeit op0(x0)
    %timeit op1(x0, x1)
    %timeit op2(x0, x1)

shape = (50, 50)
3.55 µs ± 143 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
4.85 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.85 µs ± 116 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

shape = (500, 500)
1.52 ms ± 60.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.96 ms ± 133 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
299 µs ± 50.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

shape = (5000, 5000)
163 ms ± 4.4 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
185 ms ± 11.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
39.8 ms ± 399 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Why is flipping the sign of x.imag so expensive? Surely, at low level, it's much easier than several multiplications and additions ((a + j*b)*(c + j*d))?

Windows 10 x64, numpy 1.23.5, Python 3.10.4

Solution

Here is an answer based on the many previous comments:

Unsure if I should be impressed with how fast complex multiplication is or disappointed with how slow rudimentary array stuff is.

Memory accesses are slow and they will be slower in the future because of the memory-wall (which started >20 year ago). If you want something fast, you need to not use your DRAM. Numpy is not specifically designed for that (which is quite sad). Your hardware reach 17.2 GiB/s. This is not a lot but not very bad either. Modern PCs can reach 40-60 GiB/s (some, like the M1, can even reach >60 GiB). Modern intel CPU L3 cache can reach ~400 GiB/s, so a lot more.

GPUs often have a significantly higher memory throughput but the ratio computational_speed / memory_bandwidth is still high like on CPUs (even often higher actually). CPU have pretty big caches nowadays while GPUs often do not. Note that GPU computations can be delayed by some API (lazy computation) so you should care about that in benchmarks (you can print the value for example).

op5 vs op6: extra 2S reads, 4S multiplies, 4*S adds don't even double compute time. So writes are the most expensive?

All operations should be memory bound here. Only memory operations matters.

op5 is slower than op6 because op5 read two arrays while op6 read one. More data needs to be transferred from the DRAM so it takes more time. Besides, write can be more expensive than read, but this is dependent of the compiled assembly code (so the compiler, the optimization flags and the actual source code) and the target architecture. Memory performance is a complex topic, much more than it seems (see below for more information about this).

Note that it does not matter that it's two separate arrays. 1 big array virtually split in two part would have the same impact. There is not much difference from the hardware.

General notes

Regarding the timings, modern memory/CPUs are pretty complex so it is generally not easy to understand what is going on based on benchmark (especially without the assembly code). Writes and reads are almost equally fast from the DRAM perspective. Mixing them reduce performance because of the way DRAMs work. Modern x86-64 CPU caches use a write-allocate policy resulting in read being done when write are performed. This causes writes being 2 time slower than read. That being said, non-temporal instruction can be used to avoid this issue, assuming the compiler generate them

Compilers often do not generate non temporal instructions because they can be much slower when the array fit in the cache, and in this case, the compiler used to build Numpy cannot know the size of the arrays (runtime defined). It cannot know the size of the CPU cache either. Memory copies tends to use the memcpy basic function which is optimized to use non temporal instruction for large array regarding the target platform. AFAIK, such instruction are not used in Numpy for operations like multiplications/additions, etc. (at least not on my machine).

Note that I mention "modern CPUs" because pre-Zen AMD CPUs use a very different approach. Other CPU architectures like POWER also behave very differently in this case. Any detail matters when it comes to high performance. This is also why I said the topic is complex. The best is to do a low-level profiling on your specific machine or list your hardware, the exact Numpy package (or the assembly code used) in order to avoid considering many possible things that could theoretically happen.