I'm working on an optimization problem, but to avoid getting into the details, I'm going to provide a simple example of a bug that's been giving me headaches for a few days.
Say I have a 2D numpy array with observed x-y coordinates:
from scipy.optimize import distance
x = np.array([1,2], [2,3], [4,5], [5,6])
I also have a list of x-y coordinates to compare to these points (y):
y = np.array([11,13], [12, 14])
I have a function that takes the sum of manhattan differences between a value of x
and all of the values in y
:
def find_sum(ref_row, comp_rows):
modeled_counts = []
y = ref_row * len(comp_rows)
res = list(map(distance.cityblock, ref_row, comp_rows))
modeled_counts.append(sum(res))
return sum(modeled_counts)
Essentially, what I would like to do is find the sum of manhattan distances for every item in y
with each item in x
(so basically for each item in x, find the sum of the Manhattan distances between that (x,y) pair and every (x,y) pair in y
).
I've tried this out with the following line of code:
z = list(map(find_sum, x, y))
However, z
is of length 2 (like y
), and not 4 like x
. Is there a way to ensure that z
is the result of consecutive one-to-all calculations? That is, I'd like to calculate the sum of all of the manhattan differences between x[0]
and every set in y
, and so on and so forth, so the length of z
should be equal to the length of x
.
Is there a simple way to do this without a for
loop? My data is rather large (~ 4 million rows), so I'd really appreciate fast solutions. I'm fairly new to Python programming, so any explanations about why the solution works and is fast would be appreciated as well, but definitely isn't required!
Thanks!
Here is how you can do it using numpy broadcast with simplified explanation
Adjust Shape For Broadcasting
import numpy as np
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
## using np.newaxis as index add a new dimension at that position
## : give all the elements on that dimension
start_points = start_points[np.newaxis, :, :]
dest_points = dest_points[:, np.newaxis, :]
## Now lets check he shape of the point arrays
print('start_points.shape: ', start_points.shape) # (1, 4, 2)
print('dest_points.shape', dest_points.shape) # (2, 1, 2)
Lets try to understand
We can think start_points
and dest_points
as matrix or a table of points of size (1X4) and (2X1)
We clearly see that size are not compatible. What will happen if we perform arithmatic
operation between them? Here is where a smart part of numpy comes, called broadcast.
start_points
to match that of dest_point
making matrix of (2X4)dest_point
to match that of start_points
making matrix of (2X4)Calculate the distance
diff_x_y = start_points - dest_points
print(diff_x_y.shape) # (2, 4, 2)
abs_diff_x_y = np.abs(start_points - dest_points)
man_distance = np.sum(abs_diff_x_y, axis=2)
print('man_distance:\n', man_distance)
sum_distance = np.sum(man_distance, axis=0)
print('sum_distance:\n', sum_distance)
Oneliner
start_points = np.array([[1,2], [2,3], [4,5], [5,6]])
dest_points = np.array([[11,13], [12, 14]])
np.sum(np.abs(start_points[np.newaxis, :, :] - dest_points[:, np.newaxis, :]), axis=(0,2))
Here is more detail explanation of broadcasting if you want to understand it more