Search code examples
python-3.xperformancenumpyreshapeblas

Why is numpy's kron so fast?


I was trying to implement a kronecker product function. Below are three ideas that I have:

def kron(arr1, arr2):
    """columnwise outer product, avoiding relocate elements.
    """
    r1, c1 = arr1.shape
    r2, c2 = arr2.shape
    nrows, ncols = r1 * r2, c1 * c2
    res = np.empty((nrows, ncols))
    for idx1 in range(c1):
        for idx2 in range(c2):
            new_c = idx1 * c2 + idx2
            temp = np.zeros((r2, r1))
            temp_kron = scipy.linalg.blas.dger(
                alpha=1.0, x=arr2[:, idx2], y=arr1[:, idx1], incx=1, incy=1, 
                a=temp)
            res[:, new_c] = np.ravel(temp_kron, order='F')
    return res
def kron2(arr1, arr2):
    """First outer product, then rearrange items.
    """
    r1, c1 = arr1.shape
    r2, c2 = arr2.shape
    nrows, ncols = r1 * r2, c1 * c2
    tmp = np.outer(arr2, arr1)
    res = np.empty((nrows, ncols))
    for idx in range(arr1.size):
        for offset in range(c2):
            orig = tmp[offset::c2, idx]
            dest_coffset = idx % c1 * c2 + offset
            dest_roffset = (idx // c1) * r2
            res[dest_roffset:dest_roffset+r2, dest_coffset] = orig
    return res
def kron3(arr1, arr2):
    """First outer product, then rearrange items.
    """
    r1, c1 = arr1.shape
    r2, c2 = arr2.shape
    nrows, ncols = r1 * r2, c1 * c2
    tmp = np.outer(np.ravel(arr2, 'F'), np.ravel(arr1, 'F'))
    res = np.empty((nrows, ncols))
    for idx in range(arr1.size):
        for offset in range(c2):
            orig_offset = offset * r2
            orig = tmp[orig_offset:orig_offset+r2, idx]
            dest_c = idx // r1 * c2 + offset
            dest_r = idx % r1 * r2
            res[dest_r:dest_r+r2, dest_c] = orig
    return res

Based on this stackoverflow post I created a MeasureTime decorator. A natural benchmark would be to compare against numpy.kron. Below are my test functions:

@MeasureTime
def test_np_kron(arr1, arr2, number=1000):
    for _ in range(number):
        np.kron(arr1, arr2)
    return

@MeasureTime
def test_kron(arr1, arr2, number=1000):
    for _ in range(number):
        kron(arr1, arr2)

@MeasureTime
def test_kron2(arr1, arr2, number=1000):
    for _ in range(number):
        kron2(arr2, arr1)

@MeasureTime
def test_kron3(arr1, arr2, number=1000):
    for _ in range(number):
        kron3(arr2, arr1)     

Turned out that Numpy's kron function performances much better:

arr1 = np.array([[1,-4,7], [-2, 3, 3]], dtype=np.float64, order='F')
arr2 = np.array([[8, -9, -6, 5], [1, -3, -4, 7], [2, 8, -8, -3], [1, 2, -5, -1]], dtype=np.float64, order='F')
In [243]: test_np_kron(arr1, arr2, number=10000)
Out [243]: "test_np_kron": 0.19688990000577178s
In [244]: test_kron(arr1, arr2, number=10000)
Out [244]: "test_kron": 0.6094115000014426s
In [245]: test_kron2(arr1, arr2, number=10000)
Out [245]: "test_kron2": 0.5699560000066413s
In [246]: test_kron3(arr1, arr2, number=10000)
Out [246]: "test_kron3": 0.7134822000080021s

I would like to know why that is the case? Is that because Numpy's reshape method is much more performant than manually copying over stuff (although still using numpy)? I was puzzled, since otherwise, I was using np.outer / blas.dger as well. The only difference I recognized here was how we arranged the ending results. How come NumPy's reshape perform this good?

Here is the link to NumPy 1.17 kron source.

Updates: Forgot to mention in the first place that I was trying to prototype in python, and then implement kron using C++ with cblas/lapack. Had some existing 'kron' needing to be refactored. I then came across Numpy's reshape and got really impressed.

Thanks in advance for your time!


Solution

  • Let's experiment with 2 small arrays:

    In [124]: A, B = np.array([[1,2],[3,4]]), np.array([[10,11],[12,13]])
    

    kron produces:

    In [125]: np.kron(A,B)                                                                         
    Out[125]: 
    array([[10, 11, 20, 22],
           [12, 13, 24, 26],
           [30, 33, 40, 44],
           [36, 39, 48, 52]])
    

    outer produces the same numbers, but with a different arangement:

    In [126]: np.outer(A,B)                                                                        
    Out[126]: 
    array([[10, 11, 12, 13],
           [20, 22, 24, 26],
           [30, 33, 36, 39],
           [40, 44, 48, 52]])
    

    kron reshapes it to a combination of the shapes of A and B:

    In [127]: np.outer(A,B).reshape(2,2,2,2)                                                       
    Out[127]: 
    array([[[[10, 11],
             [12, 13]],
    
            [[20, 22],
             [24, 26]]],
    
    
           [[[30, 33],
             [36, 39]],
    
            [[40, 44],
             [48, 52]]]])
    

    it then recombines 4 dimensions into 2 with concatenate:

    In [128]: np.concatenate(np.concatenate(_127, 1),1)                                            
    Out[128]: 
    array([[10, 11, 20, 22],
           [12, 13, 24, 26],
           [30, 33, 40, 44],
           [36, 39, 48, 52]])
    

    An alternative is to swap axes, and reshape:

    In [129]: _127.transpose(0,2,1,3).reshape(4,4)                                                 
    Out[129]: 
    array([[10, 11, 20, 22],
           [12, 13, 24, 26],
           [30, 33, 40, 44],
           [36, 39, 48, 52]])
    

    The first reshape and transpose produce a view, but the second reshape has to produce a copy. Concatenate makes a copy. But all those actions are done in compiled numpy code.


    Defining functions:

    def foo1(A,B):
        temp = np.outer(A,B)
        temp = temp.reshape(A.shape + B.shape)
        return np.concatenate(np.concatenate(temp, 1), 1)
    
    def foo2(A,B):
        temp = np.outer(A,B)
        nz = temp.shape
        temp = temp.reshape(A.shape + B.shape)
        return temp.transpose(0,2,1,3).reshape(nz)
    

    testing:

    In [141]: np.allclose(np.kron(A,B), foo1(A,B))                                                 
    Out[141]: True
    In [142]: np.allclose(np.kron(A,B), foo2(A,B))                                                 
    Out[142]: True
    

    timing:

    In [143]: timeit np.kron(A,B)                                                                  
    42.4 µs ± 294 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    In [145]: timeit foo1(A,B)                                                                     
    26.3 µs ± 38.6 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    In [146]: timeit foo2(A,B)                                                                     
    13.8 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    My code may need some generalization, but it demonstrates the validity of the approach.

    ===

    With your kron:

    In [150]: kron(A,B)                                                                            
    Out[150]: 
    array([[10., 11., 20., 22.],
           [12., 13., 24., 26.],
           [30., 33., 40., 44.],
           [36., 39., 48., 52.]])
    In [151]: timeit kron(A,B)                                                                     
    55.3 µs ± 1.59 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    

    edit

    einsum can do both the outer and transpose:

    In [265]: np.einsum('ij,kl->ikjl',A,B).reshape(4,4)
    Out[265]: 
    array([[10, 11, 20, 22],
           [12, 13, 24, 26],
           [30, 33, 40, 44],
           [36, 39, 48, 52]])
    
    In [266]: timeit np.einsum('ij,kl->ikjl',A,B).reshape(4,4)
    9.87 µs ± 33 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)