Search code examples

Why are ufuncs 2x faster on one axis over the other?

I measured performance of ufuncs like np.cumsum over different axes:

In [51]: arr = np.arange(int(1E6)).reshape(int(1E3), -1)

In [52]: %timeit arr.cumsum(axis=1)
2.27 ms ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [53]: %timeit arr.cumsum(axis=0)
4.16 ms ± 10.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

cumsum over axis 1 is almost 2x faster than over axis 0. What is going on behind the scenes?


  • You have a square array. It looks like this:

    1 2 3
    4 5 6
    7 8 9

    But computer memory is linearly addressed, so to the computer it looks like this:

    1 2 3 4 5 6 7 8 9

    Or, if you think about it, it might look like this:

    1 4 7 2 5 8 3 6 9

    If you are trying to sum [1 2 3] or [4 5 6] (one row), the first layout is faster. If you are trying to sum [1 4 7] or [2 5 8], the second layout is faster.

    This happens because loading data from memory happens one "cache line" at a time, which is typically 64 bytes (8 values with NumPy's default dtype of 8-byte float).

    You can control which layout NumPy uses when you construct an array, using the order parameter.

    For more on this, see: