Search code examples
pythonarraysnumpyarray-broadcasting

Fast way to do consecutive one-to-all calculations on Numpy arrays without a for-loop?


I'm working on an optimization problem, but to avoid getting into the details, I'm going to provide a simple example of a bug that's been giving me headaches for a few days.

Say I have a 2D numpy array with observed x-y coordinates:

from scipy.optimize import distance
x = np.array([1,2], [2,3], [4,5], [5,6])

I also have a list of x-y coordinates to compare to these points (y):

y = np.array([11,13], [12, 14])

I have a function that takes the sum of manhattan differences between a value of x and all of the values in y:

def find_sum(ref_row, comp_rows):
    modeled_counts = []
    y = ref_row * len(comp_rows)
    res = list(map(distance.cityblock, ref_row, comp_rows))
    modeled_counts.append(sum(res))

    return sum(modeled_counts)

Essentially, what I would like to do is find the sum of manhattan distances for every item in y with each item in x (so basically for each item in x, find the sum of the Manhattan distances between that (x,y) pair and every (x,y) pair in y).

I've tried this out with the following line of code:

z = list(map(find_sum, x, y))

However, z is of length 2 (like y), and not 4 like x. Is there a way to ensure that z is the result of consecutive one-to-all calculations? That is, I'd like to calculate the sum of all of the manhattan differences between x[0] and every set in y, and so on and so forth, so the length of z should be equal to the length of x.

Is there a simple way to do this without a for loop? My data is rather large (~ 4 million rows), so I'd really appreciate fast solutions. I'm fairly new to Python programming, so any explanations about why the solution works and is fast would be appreciated as well, but definitely isn't required!

Thanks!


Solution

  • Here is how you can do it using numpy broadcast with simplified explanation

    Adjust Shape For Broadcasting

    import numpy as np
    
    start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
    dest_points = np.array([[11,13], [12, 14]])
    
    ## using np.newaxis as index add a new dimension at that position
    ## : give all the elements on that dimension
    start_points = start_points[np.newaxis, :, :]
    dest_points = dest_points[:, np.newaxis, :]
    
    ## Now lets check he shape of the point arrays
    print('start_points.shape: ', start_points.shape) # (1, 4, 2)
    print('dest_points.shape', dest_points.shape) # (2, 1, 2)
    

    Lets try to understand

    • last element of shape represent x and y of a point, size 2
    • we can think of start_points as having 1 row and 4 columns of points
    • we can think of dest_points as having 2 rows and 1 columns of points

    We can think start_points and dest_points as matrix or a table of points of size (1X4) and (2X1) We clearly see that size are not compatible. What will happen if we perform arithmatic operation between them? Here is where a smart part of numpy comes, called broadcast.

    • It will repeat rows of start_points to match that of dest_point making matrix of (2X4)
    • It will repeat columns of dest_point to match that of start_points making matrix of (2X4)
    • Result is arithmetic operation between every pair of elements on start_points and dest_points

    Calculate the distance

    diff_x_y = start_points - dest_points
    print(diff_x_y.shape) # (2, 4, 2)
    abs_diff_x_y = np.abs(start_points - dest_points)
    man_distance = np.sum(abs_diff_x_y, axis=2)
    print('man_distance:\n', man_distance)
    sum_distance = np.sum(man_distance, axis=0)
    print('sum_distance:\n', sum_distance)
    

    Oneliner

    start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
    dest_points = np.array([[11,13], [12, 14]])
    np.sum(np.abs(start_points[np.newaxis, :, :] - dest_points[:, np.newaxis, :]), axis=(0,2))
    
    

    Here is more detail explanation of broadcasting if you want to understand it more