python numpy optimization benchmarking einsum

Numpy einsum getting crazy slow after certain problem size

I have the following benchmarking script

import numpy as np
import timeit
import matplotlib.pyplot as plt

n1, n2, n3, n4, n5, m = (101, 33, 1, 32, 2, 32)


def make_arrays(aOrder, bOrder, cOrder):
    a = np.random.randn(n1, m) + 1j * np.random.randn(n1, m)
    b = np.random.randn(n2, m) + 1j * np.random.randn(n2, m)
    c = np.random.randn(n1, n2, n3, n4, n5) + 1j * np.random.randn(n1, n2, n3, n4, n5)
    return (
        np.array(a, order=aOrder),
        np.array(b, order=bOrder),
        np.array(c, order=cOrder),
    )


# used in B()
blockSize = 84

resA = []
resB = []
resC = []

sizes = np.unique(np.exp(np.linspace(2, 6, 8)).astype(np.int64))
numTrials = 10

# overrides m form above
for m in sizes:
    a, b, c = make_arrays("F", "F", "F")

    path = np.einsum_path(
        a,
        [0, 5],
        b,
        [1, 5],
        c,
        [0, 1, Ellipsis],
        [Ellipsis, 5],
        optimize="greedy",
    )

    def A():
        np.einsum(
            a,
            [0, 5],
            b,
            [1, 5],
            c,
            [0, 1, 2, 3, 4],
            [5, 2, 3, 4],
            optimize="greedy",
            order="F",
        )
        # print("einsum\n", res.flags)

    def B():
        numBlocks = int(a.shape[1] // blockSize)
        np.concatenate(
            tuple(
                np.einsum(
                    c,
                    [1, 2, Ellipsis],
                    a[:, kk * blockSize : (kk + 1) * blockSize],
                    [1, 0],
                    b[:, kk * blockSize : (kk + 1) * blockSize],
                    [2, 0],
                    [0, Ellipsis],
                    optimize="greedy",
                    order="F",
                )
                for kk in range(numBlocks)
            )
            + (
                np.einsum(
                    c,
                    [1, 2, Ellipsis],
                    a[:, numBlocks * blockSize :],
                    [1, 0],
                    b[:, numBlocks * blockSize :],
                    [2, 0],
                    [0, Ellipsis],
                    optimize="greedy",
                    order="F",
                ),
            ),
            axis=0,
        )

    def C():
        tmp = np.einsum(a, [0, 5], c, [0, 1, 2, 3, 4], [1, 2, 3, 4, 5], order="F")
        np.einsum(b, [1, 5], tmp, [1, 2, 3, 4, 5], [2, 3, 4, 5], order="F")

    A()
    B()
    C()

    measured = np.mean(timeit.repeat(A, number=numTrials, repeat=numTrials)) / (
        numTrials * m
    )
    resA.append(measured)

    measured = np.mean(timeit.repeat(B, number=numTrials, repeat=numTrials)) / (
        numTrials * m
    )
    resB.append(measured)

    measured = np.median(timeit.repeat(C, number=numTrials, repeat=numTrials)) / (
        numTrials * m
    )
    resC.append(measured)

plt.figure()
plt.semilogy(sizes, resA, label="A")
plt.semilogy(sizes, resB, label="B")
plt.semilogy(sizes, resC, label="C")

plt.legend()
plt.show()

I think one only needs numpy and matplotlib to run it.

Approach A is my naive way of using einsum, since I expect this to work well. After some time this code has been residing in my codebase, I noticed that after a certain size of m, the computation just gets very slow. Hence, I went ahead and implemented B, which is producing satisfactory results across the board, but especially in the beginning it looses against A. Also, no matter how I toss and turn this in terms of memory-layout of the input and output-arrays, I see no noticeable difference in qualitative behavior.

Just to retain my sanity, I went ahead and tried out an even more naive way by using C, which is superslow as expected.

On a very powerful AMD EPYC 7343 16-Core Processor with numpy-intelpython, i.e. using MKL, where I have forced the MKL to only use one core for the computation, I get the following result:

Essentially I divide the runtime by the problem size m, to get an estimate for the cost of a single "slice" of the problems.

To iron out any relation to the CPU or MKL, I used my Macbook Air with the M2 chip first with a vanilla numpy:

and then also with a numpy linking again the accelerateBLAS library to make use of the GPU, or whatever, don't really care. Then I get

So my question is: what the effing heck is wrong with A?

As a kinda off-topic sidenote: I am also running the same code on cupy from time to time and there this behavior is not visible. There the corresponding version of A is scaling nicely, at least for the exact same problem sizes as used above.

Another off-topic sidenote: If I use opt_einsum for the contraction, basically with the code of A, I get something similar to B from above, but slightly slower.

Solution

Just help me picture the action, the first try, with m=100 is

        np.einsum(
    ...:             a,
    ...:             [0, 5],
    ...:             b,
    ...:             [1, 5],
    ...:             c,
    ...:             [0, 1, 2, 3, 4],
    ...:             [5, 2, 3, 4],
    ...:             #optimize="greedy",
    ...:         ).shape
Out[15]: (100, 1, 32, 2)

default greedy show the same thing:

In [17]:         print(np.einsum_path(
    ...:             a,
    ...:             [0, 5],
    ...:             b,
    ...:             [1, 5],
    ...:             c,
    ...:             [0, 1, 2, 3, 4],
    ...:             [5, 2, 3, 4],
    ...:             optimize="greedy",
    ...:         )[1])
  Complete contraction:  af,bf,abcde->fcde
         Naive scaling:  6
     Optimized scaling:  6
      Naive FLOP count:  6.399e+07
  Optimized FLOP count:  4.308e+07
   Theoretical speedup:  1.485
  Largest intermediate:  2.112e+05 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   6             abcde,af->cedbf                           bf,cedbf->fcde
   5              cedbf,bf->fcde                               fcde->fcde

With optimize=False (none):

  Complete contraction:  af,bf,abcde->fcde
         Naive scaling:  6
     Optimized scaling:  6
      Naive FLOP count:  6.399e+07
  Optimized FLOP count:  6.399e+07
   Theoretical speedup:  1.000
  Largest intermediate:  6.400e+03 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   6           abcde,bf,af->fcde                               fcde->fcde

I picture this as a double matmul with f as the 'batchdimension,aandbthe two sum-of-product dimensions, andcde` reducible to 1 dimension ('going along for the ride').

With optimize=False, the timeit is

242 ms ± 941 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

with greedy:

21.5 ms ± 743 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

The equivalent double matmul times the same

In [41]: timeit res=b.T[:,None,:]@(a.T@c.reshape(c.shape[0],-1)).reshape(m,c.shape[1],-1
    ...: )
22.3 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

in this res.shape is (100, 1, 64)

I wonder if this is a more straightforward matmul dimension layout?

'cde1ab,fb1->cdefa1; cdefa1,fa1->cdef11'

In [96]: c1 = c.transpose((2,3,4,0,1))[...,None,:,:]
In [97]: a1 = a.T[...,None]; b1 = b.T[...,None]
In [98]: res3 = ((c1@b1)[...,None,:,0]@a1)[...,0,0]
In [99]: res3.shape
Out[99]: (1, 32, 2, 100)

larger m`

If I change a and b to m=300, I get a different einsum_path

In [106]: print(np.einsum_path(
     ...:     ...:             abig,
     ...:     ...:             [0, 5],
     ...:     ...:             bbig,
     ...:     ...:             [1, 5],
     ...:     ...:             c,
     ...:     ...:             [0, 1, 2, 3, 4],
     ...:     ...:             [5, 2, 3, 4],
     ...:     ...:             optimize="greedy",
     ...:     ...:         )[1])
  Complete contraction:  af,bf,abcde->fcde
         Naive scaling:  6
     Optimized scaling:  6
      Naive FLOP count:  1.920e+08
  Optimized FLOP count:  1.920e+08
   Theoretical speedup:  1.000
  Largest intermediate:  1.920e+04 elements
--------------------------------------------------------------------------
scaling                  current                                remaining
--------------------------------------------------------------------------
   6           abcde,bf,af->fcde  

In [111]: %%timeit
     ...:         np.einsum(
     ...:             abig,
     ...:             [0, 5],
     ...:             bbig,
     ...:             [1, 5],
     ...:             c,
     ...:             [0, 1, 2, 3, 4],
     ...:             [5, 2, 3, 4],
     ...:             optimize=True,
     ...:         )
     ...: 
464 ms ± 2.22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

matmul equivalent only jumps 3x, linear in m:

In [112]: a11 = abig.T[...,None]; b11 = bbig.T[...,None]
In [113]: timeit ((c1@b11)[...,None,:,0]@a11)[...,0,0].shape
74.9 ms ± 637 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [114]: a11.shape,b11.shape
Out[114]: ((300, 101, 1), (300, 33, 1))