Search code examples
pythonperformancenumpymatrix-multiplicationscalar

Most efficient way to multiply a small matrix with a scalar in numpy


I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000:

large_dimension = 1000

a = np.random.random((1,))
b = np.random.random((1, large_dimension))

c = np.matmul(a, b)

In other words, multiplying matrix b with the scalar a[0].

I am looking for the most efficient way to compute this, since this operation is repeated millions of times.

I tested for performance of the two trivial ways to do this, and they are practically equivalent:

%timeit np.matmul(a, b)
>> 1.55 µs ± 45.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

%timeit a[0] * b
>> 1.77 µs ± 34.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Is there a more efficient way to compute this?

  • Note: I cannot move these computations to a GPU since the program is using multiprocessing and many such computations are done in parallel.

Solution

  • large_dimension = 1000
    
    a = np.random.random((1,))
    B = np.random.random((1, large_dimension))
    
    %timeit np.matmul(a, B)
    5.43 µs ± 22 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    
    %timeit a[0] * B
    5.11 µs ± 6.92 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    Use just float

    %timeit float(a[0]) * B
    3.48 µs ± 26.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    To avoid memory allocation use "buffer"

    buffer = np.empty_like(B)
    
    %timeit np.multiply(float(a[0]), B, buffer)
    2.96 µs ± 37.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    To avoid unnecessary getting attribute use "alias"

    mul = np.multiply
    
    %timeit mul(float(a[0]), B, buffer)
    2.73 µs ± 12.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
    

    And I don't recommend using numpy scalars at all, because if you avoid it, computation will be faster

    a_float = float(a[0])
    
    %timeit mul(a_float, B, buffer)
    1.94 µs ± 5.74 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
    

    Furthermore, if it's possible then initialize buffer out of loop once (of course, if you have something like loop :)

    rng = range(1000)
    
    %%timeit
    for i in rng:
        pass
    24.4 µs ± 1.21 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
    
    %%timeit
    for i in rng:
        mul(a_float, B, buffer)
    1.91 ms ± 2.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    

    So,

    "best_iteration_time" = (1.91 - 0.02) / 1000 => 1.89 (µs)

    "speedup" = 5.43 / 1.89 = 2.87