Fastest way to calculate least squares between multiple points in Python with Numpy

I have a problem which can best be described by the below analogy...

I have an "ocean" in XY space consisting of evenly spaced nodes which can be used to describe a given position on the ocean, for example:

#in my actual use case these arrays have >100,000 points.
node_x_values = np.array(0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4)  
node_y_values = np.array(0,0,0,0,0,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4)

This shows that the first node is at 0,0 and the last node is at x,y = 4,4.

I also have an array showing which which nodes have "boats" on them, e.g:

boat_list = np.array([3,7,18])  #there are 3 boats in total, one on node 3, 7 and 18. In my actual use case there are 200+ "boats"

To get the x and y position of the boat on node 7 it is simply a matter of:

boat_x = node_x_values[7]
boat_y = node_y_values[7]

What I am looking for is the fastest way to determine the node with the minimum distance to a boat for all of the boats at once.

For one boat this is easy using least squares, e.g. boat on node 7:

boat_distances = (((node_x_values-boat_x)**2)+((node_y-values-boat_y)**2))**0.5
node_with_min_distance  = np.argmin(boat_distances)

Is there a fast way to do it for all the boats in my list other than something like the below:

temp_accumulator = np.full(9999, node_x_values.size, dtype=np.float32)
for boat in boat_list:
   boat_x = node_x_values[boat]
   boat_y = node_y_values[boat]
   boat_distances = (((node_x_values-boat_x)**2)+((node_y_values-boat_y)**2))**0.5
   temp_accumulator_array = np.minimum(temp_accumulator_array, boat_distances)

node_with_least_distance_to_boat = np.argmin(temp_accumulator)

Is there a faster way to do this without looping over each boat? I have 200+ "boats" and 1000,000 node positions so this runs very slowly even with numba and multithreading (my actual use and code are quite different to the above but this was the best example I could think of to explain the problem!

Solution

If I understand the problem correctly it screams for using a KDtree.

It is a data-structure that is loaded up with fixed points and then can be used to query the closest points to a set of new positions.

So in your case you load the KDtree with your nodes and then query your boat positions.

There are a few implementations, like pykdtree, others are part of scipy and sklearn.

# python -m pip install pykdtree

import numpy as np
from pykdtree.kdtree import KDTree

node_x_values = np.array([0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4,0,1,2,3,4])  
node_y_values = np.array([0,0,0,0,0,1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4])

node_pts= np.r_[node_x_values, node_y_values]

kd_tree = KDTree(node_pts)

boat_list = np.array([3,7,18])
boat_pts = node_pts[boat_list]

# k is the number of neightbors, here 2 = only two closest
dist, idx = kd_tree.query(boat_pts , k=2)

# the closest is always the node the ship is at, so that's not useful
indices = idx[:,1]
distances = dist[:,1]

print('closest node for each boat', indices )
print('closest points for each boat', node_pts[indices ])
print('distance to next node for each boat', distances)

For your problem with N nodes and M boats. The complexity of this search is O(M log(N)). So it should run very fast with your problem size of N = 10**6, M = 200.'

And if you want the node with the least distance to any boat, you just check id = indices[np.argmin(distances)]. This gives you the node ID and node_pts[id] the position.