Search code examples
pythonnumpysparse-matrixmatrix-multiplicationelementwise-operations

numpy elementwise outer product with sparse matrices


I want to do the element-wise outer product of three (or four) large 2D arrays in python (values are float32 rounded to 2 decimals). They all have the same number of rows "n", but different number of columns "i", "j", "k".
The resulting array should be of shape (n, i*j*k). Then, I want to sum each column of the result to end up with a 1D array of shape (i*j*k).

np.shape(a) = (75466, 10)
np.shape(b) = (75466, 28)
np.shape(c) = (75466, 66)

np.shape(intermediate_result) = (75466, 18480)
np.shape(result) = (18480)

Thanks to ruankesi and divakar, I got a piece of code that works:

# Multiply first two matrices
first_multi = a[...,None] * b[:,None]
# could use np.einsum('ij,ik->ijk',a,b), which is slightly faster
ab_fills = first_multi.reshape(a.shape[0], a.shape[1]*b.shape[1])

# Multiply the result with the third matrix
second_multi = ab_fills[..., None] * c[:,None]
abc_fills = second_multi.reshape(ab_fills.shape[0], ab_fills.shape[1] * c.shape[1])

# Get the result: sum columns and get a 1D array of length 10*28*66 = 18 480
result = np.sum(abc_fills, axis = 0)

Problem 1: Performance

This takes about 3 seconds, but I have to repeat this operation many times and some of the matrices are even larger (in number of rows). It is acceptable but making it faster would be nice.

Problem 2: My matrices are sparse

Indeed, for instance, "a" contains 70% of 0s. I tried to play with scipy csc_matrix, but really could not get a working version. (to get the element-wise outer product here I go via a conversion to a 3D matrix, which are not supported in scipy sparse_matrix)

Problem 3: memory usage

If I try to also work with a 4th matrix, I run into memory issues.


I imagine that converting this code to sparse_matrix would save a lot of memory, and make the calculation faster by ignoring the numerous 0 values. Is that true? If yes, can someone help me?
Of course, if you have any suggestion for a better implementation, I am also very interested. I don't need any of the intermediate results, just the final 1D result.
It's been weeks I'm stuck on this part of code, I am going nuts!

Thank you!



Edit after Divakar's answer

Approach #1:
Very nice one liner but surprisingly slower than the original approach (?).
On my test dataset, approach #1 takes 4.98 s ± 3.06 ms per loop (no speedup with optimize = True)
The original decomposed approach took 3.01 s ± 16.5 ms per loop


Approach #2:
Absolutely great, thank you! What an impressive speedup!
62.6 ms ± 233 µs per loop


About numexpr, I try to avoid as much as possible requirements for external modules, and I don't plan to use multicores/threads. This is an "embarrassingly" parallelizable task, with hundreds of thousands of objects to analyze, I'll just spread the list across available CPUs during production. I will give it a try for memory optimization.
As a brief try of numexpr with a restriction for 1 thread, performing 1 multiplication, I get a runtime of 40ms without numexpr, and 52 ms with numexpr.
Thanks again!!


Solution

  • Approach #1

    We can use np.einsum to do sum-reductions in one go -

    result = np.einsum('ij,ik,il->jkl',a,b,c).ravel()
    

    Also, play around with the optimize flag in np.einsum by setting it as True to use BLAS.

    Approach #2

    We can use broadcasting to do the first step as also mentioned in the posted code and then leverage tensor-matrix-multiplcation with np.tensordot -

    def broadcast_dot(a,b,c):
        first_multi = a[...,None] * b[:,None]
        return np.tensordot(first_multi,c, axes=(0,0)).ravel()
    

    We can also use numexpr module that supports multi-core processing and also achieves better memory efficiency to get first_multi. This gives us a modified solution, like so -

    import numexpr as ne
    
    def numexpr_broadcast_dot(a,b,c):
        first_multi = ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
        return np.tensordot(first_multi,c, axes=(0,0)).ravel()
    

    Timings on random float data with given dataset sizes -

    In [36]: %timeit np.einsum('ij,ik,il->jkl',a,b,c).ravel()
    4.57 s ± 75.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [3]: %timeit broadcast_dot(a,b,c)
    270 ms ± 103 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [4]: %timeit numexpr_broadcast_dot(a,b,c)
    172 ms ± 63.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    Just to give a sense of improvement with numexpr -

    In [7]: %timeit a[...,None] * b[:,None]
    80.4 ms ± 2.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [8]: %timeit ne.evaluate('A*B',{'A':a[...,None],'B':b[:,None]})
    25.9 ms ± 191 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
    

    This should be substantial when extending this solution to higher number of inputs.