Search code examples
pythonarraysmatlabvectorizationbsxfun

Is MATLAB's bsxfun the best? Python's numpy.einsum?


I have a very large multiply and sum operation that I need to implement as efficiently as possible. The best method I've found so far is bsxfun in MATLAB, where I formulate the problem as:

L = 10000;
x = rand(4,1,L+1);
A_k = rand(4,4,L);
tic
for k = 2:L
    i = 2:k;
    x(:,1,k+1) = x(:,1,k+1)+sum(sum(bsxfun(@times,A_k(:,:,2:k),x(:,1,k+1-i)),2),3);
end
toc

Note that L will be larger in practice. Is there a faster method? It's strange that I need to first add the singleton dimension to x and then sum over it, but I can't get it to work otherwise.

It's still much faster than any other method I've tried, but not enough for our application. I've heard rumors that the Python function numpy.einsum may be more efficient, but I wanted to ask here first before I consider porting my code.

I'm using MATLAB R2017b.


Solution

  • I believe both of your summations can be removed, but I only removed the easier one for the time being. The summation over the second dimension is trivial, since it only affects the A_k array:

    B_k = sum(A_k,2);
    for k = 2:L
        i = 2:k;
        x(:,1,k+1) = x(:,1,k+1) + sum(bsxfun(@times,B_k(:,1,2:k),x(:,1,k+1-i)),3);
    end
    

    With this single change the runtime is reduced from ~8 seconds to ~2.5 seconds on my laptop.

    The second summation could also be removed, by transforming times+sum into a matrix-vector product. It needs some singleton fiddling to get the dimensions right, but if you define an auxiliary array that is B_k with the second dimension reversed, you can generate the remaining sum as ~x*C_k with this auxiliary array C_k, give or take a few calls to reshape.


    So after a closer look I realized that my original assessment was overly optimistic: you have multiplications in both dimensions in your remaining term, so it's not a simple matrix product. Anyway, we can rewrite that term to be the diagonal of a matrix product. This implies that we're computing a bunch of unnecessary matrix elements, but this still seems to be slightly faster than the bsxfun approach, and we can get rid of your pesky singleton dimension too:

    L = 10000;
    x = rand(4,L+1);
    A_k = rand(4,4,L);
    B_k = squeeze(sum(A_k,2)).';
    
    tic
    for k = 2:L
        ii = 1:k-1;
        x(:,k+1) = x(:,k+1) + diag(x(:,ii)*B_k(k+1-ii,:));
    end
    toc
    

    This runs in ~2.2 seconds on my laptop, somewhat faster than the ~2.5 seconds obtained previously.