Search code examples
pythonscipyspatialkdtreemap-projections

Python nearest neighbour - coordinates


I wanted to check I was using scipy's KD tree correctly because it appears slower than a simple bruteforce.

I had three questions regarding this:

Q1.

If I create the following test data:

nplen = 1000000
# WGS84 lat/long
point = [51.349,-0.19]
# This contains WGS84 lat/long
points = np.ndarray.tolist(np.column_stack(
        [np.round(np.random.randn(nplen)+51,5),
         np.round(np.random.randn(nplen),5)]))

And create three functions:

def kd_test(points,point):
    """ KD Tree"""
    return points[spatial.KDTree(points).query(point)[1]]

def ckd_test(points,point):
    """ C implementation of KD Tree"""
    return points[spatial.cKDTree(points).query(point)[1]]

def closest_math(points,point):
    """ Simple angle"""
    return (min((hypot(x2-point[1],y2-point[0]),y2,x2) for y2,x2 in points))[1:3]   

I would expect the cKD tree to be the fastest, however - running this:

print("Co-ordinate: ", f(points,point))
print("Index: ", points.index(list(f(points,point))))
%timeit f(points,point)

Result times - the simple bruteforce method is faster:

closest_math: 1 loops, best of 3: 3.59 s per loop
ckd_test: 1 loops, best of 3: 13.5 s per loop
kd_test: 1 loops, best of 3: 30.9 s per loop

Is this because I am using it wrong - somehow?

Q2.

I would assume that the even to get the ranking (rather than distance) of closest points one still needs to project the data. However, it seems that the projected and un-projected points give me the same nearest neighbour:

def proj_list(points,
              inproj = Proj(init='epsg:4326'),
              outproj = Proj(init='epsg:27700')):
    """ Projected geo coordinates"""
    return [list(transform(inproj,outproj,x,y)) for y,x in points]
proj_points = proj_list(points)
proj_point = proj_list([point])[0]

Is this just because my spread of points is not big enough to introduce distortion? I re-ran a few times and still got the same index out of the projected and un-projected lists being returned.

Q3.

Is it generally faster to project the points (like above) and calculate the hypotenuse distance compared to calculating the haversine or vincenty distance on (un-projected) latitude/longitudes? Also which option would be more accurate? I ran a small test:

from math import *
def haversine(origin,
              destination):
    """
    Find distance between a pair of lat/lng coordinates
    """
    lat1, lon1, lat2, lon2 = map(radians, [origin[0],origin[1],destination[0],destination[1]])
    dlon = lon2 - lon1
    dlat = lat2 - lat1
    a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
    c = 2 * asin(sqrt(a))
    r = 6371000  # Metres
    return (c * r)

def closest_math_unproj(points,point):
    """ Haversine on unprojected """
    return (min((haversine(point,pt),pt[0],pt[1]) for pt in points))

def closest_math_proj(points,point):
    """ Simple angle since projected"""
    return (min((hypot(x2-point[1],y2-point[0]),y2,x2) for y2,x2 in points)) 

Results:

enter image description here

So this seems to say that projecting and then doing distance is faster than not - however, I am not sure which method will bring more accurate results.

Testing this on an online vincenty calculation is seems the projected co-ordinates are the way to go:

enter image description here


Solution

  • Q1.

    The reason for the apparent inefficiency of the k-d tree is quite simple: you are measuring both the construction and querying of the k-d tree at once. This is not how you would or should use a k-d tree: you should construct it only once. If you measure only the querying, the time taken reduces to mere tens of milliseconds (vs seconds using the brute-force approach).

    Q2.

    This will depend on the spatial distribution of the actual data being used and the projection being used. There might be slight differences based on how efficient the implementation of the k-d tree is at balancing the constructed tree. If you are querying only a single point, then the result will be deterministic and unaffected by the distribution of points anyway.

    With the sample data that you are using, which has strong central symmetry, and with your map projection (Transverese Mercator), the difference should be negligible.

    Q3.

    Technically, the answer to your question is trivial: using the Haversine formula for geographic distance measurement is both more accurate and slower. Whether the tradeoff between accuracy and speed is warranted depends heavily on your use case and the spatial distribution of your data (mostly on the spatial extent, obviously).

    If the spatial extent of your points is on the small, regional side, then using a suitable projection and the simple Euclidean distance measure might be accurate enough for your use case and faster than using the Haversine formula.