Element-by-element or matrix execution

I am engaged in studying a deep learning course (http://ufldl.stanford.edu/housenumbers/) that has tasked me with implementing the KNN method.

The assignment involves creating a Distance matrix D where each element represents the distance (using mathematical norms) between every observation in the training dataset and every observation in the test dataset. D[i][j] denotes the Manhattan distance (L1 norm) between the "i"-th observation from the training dataset and the "j"-th observation from the test dataset.

There are three approaches to accomplishing this:

The first is (with two loops):

def compute_distances_two_loops(train_X, test_X):
        '''
        Computes L1 distance from every sample of X to every training sample
        Uses simplest implementation with 2 Python loops

        Arguments:
        train_X, np array (num_train_samples, num_features)
        test_X,  np array (num_test_samples, num_features)


        Returns:
        dists, np array (num_test_samples, num_train_samples) - array
           with distances between each test and each train sample
        '''

        num_train = train_X.shape[0]
        num_test = test_X.shape[0]
        dists = np.zeros((num_test, num_train), np.float32)
        for i_test in range(num_test):
            for i_train in range(num_train):

                dists[i_test][i_train] = np.sum( np.abs( test_X[i_test] - train_X[i_train] ) )
     
        return dists

The second is (with one loop):

def compute_distances_one_loop(train_X, test_X):
        '''
        Arguments:
        train_X, np array (num_train_samples, num_features)
        test_X,  np array (num_test_samples, num_features)
        
        Returns:
        dists, np array (num_test_samples, num_train_samples) - array
           with distances between each test and each train sample
        '''

        num_train = train_X.shape[0]
        num_test = test_X.shape[0]
        dists = np.zeros((num_test, num_train), np.float32)
        for i_test in range(num_test):

            diff_matrix = np.abs(train_X - np.tile(test_X[i_test], (num_train, 1)))
            to_sum_matrix = np.ones((train_X.shape[1], 1), np.float32)

            dists[i_test] = (diff_matrix @ to_sum_matrix).T

        return dists

And the third is (with no loops):

def compute_distances_no_loops(train_X, test_X):
        '''
        Computes L1 distance from every sample of X to every training sample
        Fully vectorizes the calculations using numpy

        Arguments:
        train_X, np array (num_train_samples, num_features)
        test_X,  np array (num_test_samples, num_features)
        
        Returns:
        dists, np array (num_test_samples, num_train_samples) - array
           with distances between each test and each train sample
        '''

        num_train = train_X.shape[0]
        num_test = test_X.shape[0]
       
        large_row_matrix_test = np.repeat(test_X, num_train, axis=0)
        large_row_matrix_train = np.tile(train_X, (num_test, 1))

        diff_matrix = np.abs(large_row_matrix_test - large_row_matrix_train)
        to_sum_matrix = np.ones((train_X.shape[1], 1), np.float32)

        dists = np.reshape( diff_matrix @ to_sum_matrix, (num_test, num_train) )

        return dists

Next, the task requires me to measure the execution time of each method:

# Lets look at the performance difference
%timeit knn_classifier.compute_distances_two_loops(binary_test_X)
%timeit knn_classifier.compute_distances_one_loop(binary_test_X)
%timeit knn_classifier.compute_distances_no_loops(binary_test_X)

I was surprised by the results, as they turned out to be:

614 ms ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

836 ms ± 35.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

961 ms ± 30.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's intriguing that the simplest method ended up being the fastest, even though it employs loops, which are typically advised to be avoided by proficient programmers. This situation has led me to question whether I might have made an error or misunderstood a fundamental concept. Alternatively, it's plausible that THIS scenario illustrates an exception where USING LOOPS is, in fact, advantageous.

Please bring clarity to this case.

Solution

It's broadly true that it's better to avoid Python loops when using numpy. However, this should be seen as part of a general principle: use as much as you can of numpy's inbuilt functionality that enables it to be faster than native Python code by taking advantage of C loops and the CPU's multi-data instructions.

Here, your "no loops" function indeed doesn't have loops, but it expends time on data copying when it does np.repeat and np.tile, to an extent that outweighs the time saved by avoiding Python loops. To avoid this data copying while still avoiding loops, you can use broadcasting.

The trick is to treat your arrays as 3D. In the code below, train_bc has the training samples along axis 1 and the features along axis 2, while test_bc has the testing samples along axis 0 and the features along axis 2. We can then subtract those arrays so that distance_components[i][j][k] gives the distance component for feature k between testing sample i and training sample j. Finally, summing over axis 2 gives the distance matrix.

(Note, this gives the same output as your compute_distances_no_loops, but I think it's actually the transpose of what your exercise requires, based on the definition of D[i][j] in your question.)

num_train_samples = 5000
num_test_samples = 500
num_features = 50

import numpy as np

train_X = np.random.uniform(size=(num_train_samples, num_features))
test_X = np.random.uniform(size=(num_test_samples, num_features))

def compute_distances_broadcast(train_arr, test_arr):
    # Add additional axes so the calculation can be done with broadcasting
    train_bc = train_arr[np.newaxis, :, :]  # Axes are (_, sample, feature)
    test_bc = test_arr[:, np.newaxis, :]    # Axes are (sample, _, feature)
    distance_components = np.abs(train_bc - test_bc)   # Axes are (test_sample, train_sample, feature)
    distances = np.sum(distance_components, axis=2)   # Axes are (test_sample, train_sample)
    return distances  

bc = compute_distances_broadcast(train_X, test_X)
nl = compute_distances_no_loops(train_X, test_X)
print(np.allclose(bc, nl))  # prints True, i.e. the two functions gives the same result

Timing this for the random input I created (which obviously isn't what you performed your timings on):

compute_distances_broadcast(train_X, test_X): 818 ms ± 33.8 ms per loop

compute_distances_no_loops(train_X, test_X): 1.68 s ± 516 ms per loop

So the solution without repeat or tile is around twice as fast as the original no_loops.