Good rules of thumb for when to use matrix multiplication, sum(), or for loop in Matlab?

I'm trying to develop general heuristics for translating equations to code. This particular question addresses how to implement an equation with a summation function into Matlab.

An example of using sum() vs. matrix multiplication:

I implemented this equation, and thought I needed to use a sum() function:

equation1

J = 1/(2*m) * sum( (X*theta - y).^2 );

Then I implemented this similar equation, without needing to use a sum() function!

equation2

theta = theta - (alpha/m) * ((X*theta - y)'*X)';

Where:

X: 100x2 (training input plus a 'ones' vector)
y: 100x1 (training output) 
theta:  2x1 (parameters)
m: 100 (length of y)
alpha: 0.01 (learning rate)

What are the principles for when Matlab's matrix multiplication "handles" the summation?

Thanks!

Solution

Always use matrix multiplication or anything dealing with matrices or vectors in a linear algebra context. Specifically, if you can compute whatever you need to compute using linear algebra (a combination of addition, subtraction, multiplication, etc. on matrices), then do it. The reason why MATLAB was created was to perform operations using linear algebra as fast as possible. Using sum would certainly be slower. For example, look at this post: fast matrix multiplication in Matlab

This post also provides insight: Matlab matrix multiplication speed. MATLAB also performs this multi-threaded and is heavily optimized for multiple cores.

If you want a test, let's tackle the easier case (equation 1) where we can see that you can use either sum or matrix multiplication to calculate this quantity. You can compute J also using matrix multiplication with:

d = X*theta - y;
J = 1/(2*m)*(d.'*d);

The above uses the definition of the dot product to compute the sum of squared differences, which can be computed using matrix multiplication where X*theta - y is considered as a m x 1 matrix. With the above, what you are computing specifically is the cost function for linear regression that is to be minimized via gradient descent. Let's create a rather large parameter vector for theta to be 100 x 1, and a data matrix to be 10000000 x 100 where we have 10 million data points by 100 parameters. I have a lot of RAM on my machine so you may not be able to run this test. I'll also initialize these all to random numbers and set a seed to ensure reproducibility. Let's use timeit and see how long these both will take. This is a test function that I wrote:

function test_grad

rng(123);
theta = rand(100,1);
X = rand(1e7, 100);
y = rand(1e7, 1);
m = size(X, 1);

    function test1
    out = 1/(2*m) * sum( (X*theta - y).^2 );
    end

    function test2
    d = X*theta - y;
    out = 1/(2*m)*(d.'*d);
    end

t1 = timeit(@test1);
t2 = timeit(@test2);
fprintf('The timing for sum: %f seconds\n', t1);
fprintf('The timing for matrix multiplication: %f seconds\n', t2);
end

When you run this function in MATLAB, it does extensive tests between using sum and using matrix multiplication.

This is what I get when I run this function. I have 16 GB of RAM on a MacBook Pro with an i7 Intel Core 2.3 GHz CPU:

>> test_grad
The timing for sum: 0.594337 seconds
The timing for matrix multiplication: 0.393643 seconds

As you can see, matrix multiplication (at least on my machine) has a 0.2 second difference on average for each run with timeit.

tl;dr: If you can use matrix multiplication, do it. It's the fastest you'll ever be able to get your code running.