algorithm apache-spark geolocation haversine bigdata

Efficiently find closest point from another point

I have a list A of coordinates (latitute, longitute in decimal form) with ~10.000 points and a second list B of coordinates of the same type with ~1 million points.

I want to find the closest point in list A for each element in list B.

What I have already done is create the cartesian product of the two lists and find the distance of all the combinations using the haversine formula.

Then I get the points in list A which have the minimum distance for each point in list B.

Since the total combinations are more than 10 billion, the time taken to calculate the distances is too long.

Is there a way to ensure that every point in list B will match a point in list A but also improve the performance?

Solution

If you have already created the cross product and worked out all the haversine distances, then you have already done the majority of the work, so I will assume the question is about what to do if you have new sets A and B.

To repeatedly find the closest point in A I would build some sort of tree structure containing the points in A, and store information at each node of the tree which amounts to a bounding box or equivalent enclosing all of its descendants. Then when trying to find the closest point in A, you recursively search the tree containing A, returning from recursive calls when you have reached a node and you can work out from the information stored there that all of its descendants are further from the target point than the closest match so far.

For this code to work, the bounding box information needs to be accurate, but if the tree is stupid it will slow searches down but not stop them from finding the correct answer. This means, in particular, that when you build the tree you can safely ignore the inconvenient habit of longitude of wrapping round at 180W = 180E. You could pretend that lat-long is a rectangular grid and build a k-d tree, you could combine latitude and longitude and bit-interleave them and build a 1-dimensional search tree on the result, you could compute the https://en.wikipedia.org/wiki/Geohash and build a search tree based on this, or you could calculate lots of haversines and build a https://en.wikipedia.org/wiki/Cover_tree - all of these should work and I have no idea which will be best - it might depend on your data and the libraries you have available.