Search code examples
pythonalgorithm2d

Given two lists of 2d points, how to find the closest point in the 2nd list for every point in the 1st list?


I have two large numpy arrays of randomly sorted 2d points, let's say they're A and B. What I need to do is find the number of "matches" between the two arrays, where a match is a point in A (call it A') being within some given radius R with a point in B (call it B'). This means that every point in A must match with either 1 or no points in B. It would also be nice to return the list indices of the matches between the two arrays, however this isn't necessary. Since there can be many points in this radius R, it seems better to find the point which is closest to A' in B, and then checking if it's within the radius R. This is tested simply with the distance formula dx^2 + dy^2. Obviously there's the brute force O(n^2) solution of looping through both arrays, but I need something faster, hopefully O(n log n).

What I've seen is that a Voronoi diagram can be used for a problem like this, however I'm not sure how this would be implemented. I'm unfamiliar with Voronoi diagrams, so I'm generating it with scipy.spatial.Voronoi. Is there a fast algorithm for this problem by using these diagrams or is there another?


Solution

  • I think there are several options. I ginned up a small comparison test to explore a few. The first couple of these only go as far as finding how many points are mutually within radius of each other to make sure I was getting consistent results on the main part of the problem. It does not answer the mail on the part of your problem about finding the closest, which I think would be just a bit more work on a few of them--did it for the last option, see bottom of post. The driver of the problem is doing all of the comparisons, and I think you can make some hay by some sorting (last notion here) to limit comparisons.

    Naive Python

    Use brute force point-to-point comparison. Clearly O(n^2).

    Scipy's cdist module

    Works great & fastest for "small" data. With large data, this starts to blow up because of size of matrix output in memory. Probably infeasible for 1M x 1M application.

    Scipy's KDTree module

    From other solution. Fast, but not as fast as cdist or "sectioning" (below). Perhaps there is a different way to employ KDTree for this task... I'm not very experienced with it. This approach (below) seemed logical.

    Sectioning the compare-to array

    This works very well because you are not interested in all of the distances, you just want ones that are within a radius. So, by sorting the target array and only looking within a rectangular window around it for "contenders" you can get very fast performance w/ native python and no "memory explosion." Probably still a bit "left on the table" here for enhancement maybe by embedding cdist within this implementation or (gulp) trying to multithread it.

    Other ideas...

    This is a tight "mathy" loop so trying something in cython or splitting up one of the arrays and multi-threading it would be novel. And pickling the result so you don't have to run this often seems prudent.

    I think any of these you could augment the tuples with the index within the array pretty easily to get a list of the matches.

    My older iMac does 100K x 100K in 90 seconds via sectioning, so that does not bode well for 1M x 1M

    Comparison:

    # distance checker
    
    from random import uniform
    import time
    import numpy as np
    from scipy.spatial import distance, KDTree
    from bisect import bisect
    from operator import itemgetter
    import sys
    from matplotlib import pyplot as plt
    sizes = [100, 500, 1000, 2000, 5000, 10000, 20000]
    #sizes = [20_000, 30_000, 40_000, 50_000, 60_000]   # for the playoffs.  :)
    naive_times = []
    cdist_times = []
    kdtree_times = []
    sectioned_times = []
    delta = 0.1
    
    for size in sizes:
        print(f'\n *** running test with vectors of size {size} ***')
        r = 20  # radius to match
        r_squared = r**2
    
        A = [(uniform(-1000,1000), uniform(-1000,1000)) for t in range(size)]
        B = [(uniform(-1000,1000), uniform(-1000,1000)) for t in range(size)]
    
        # naive python
        print('naive python')
        tic = time.time()
        matches = [(p1, p2) for p1 in A
                            for p2 in B
                            if (p1[0] - p2[0])**2 + (p1[1] - p2[1])**2 <= r_squared]
    
        toc = time.time()
        print(f'found: {len(matches)}')
        naive_times.append(toc-tic)
        print(toc-tic)
        print()
    
        # using cdist module
        print('cdist')
        tic = time.time()
        dist_matrix = distance.cdist(A, B, 'euclidean')
        result = np.count_nonzero(dist_matrix<=r)
        toc = time.time()
        print(f'found: {result}')
        cdist_times.append(toc-tic)
        print(toc-tic)
        print()
    
        # KDTree
        print('KDTree')
        tic = time.time()
        my_tree = KDTree(A)
        results = my_tree.query_ball_point(B, r=r)
        # for count, r in enumerate(results):
        #   for t in r:
        #       print(count, A[t])
    
        result = sum(len(lis) for lis in results)
        toc = time.time()
        print(f'found: {result}')
        kdtree_times.append(toc-tic)
        print(toc-tic)
        print()
    
        # python with sort and sectioning
        print('with sort and sectioning')
        result = 0
        tic = time.time()
        B.sort()
        for point in A:
            # gather the neighborhood in x-dimension within x-r <= x <= x+r+1
            # if this has any merit, we could "do it again" for y-coord....
            contenders = B[bisect(B,(point[0]-r-delta, 0)) : bisect(B,(point[0]+r+delta, 0))]
            # further chop down to the y-neighborhood
            # flip the coordinate to support bisection by y-value
            contenders = list(map(lambda p: (p[1], p[0]), contenders))
            contenders.sort()
            contenders = contenders[bisect(contenders,(point[1]-r-delta, 0)) : 
                                    bisect(contenders,(point[1]+r+delta, 0))]
            # note (x, y) in contenders is still inverted, so need to index properly
            matches = [(point, p2) for p2 in contenders if (point[0] - p2[1])**2 + (point[1] - p2[0])**2 <= r_squared]
            result += len(matches)
        toc = time.time()
        print(f'found: {result}')
        sectioned_times.append(toc-tic)
        print(toc-tic)
    print('complete.')
    
    plt.plot(sizes, naive_times, label = 'naive')
    plt.plot(sizes, cdist_times, label = 'cdist')
    plt.plot(sizes, kdtree_times, label = 'kdtree')
    plt.plot(sizes, sectioned_times, label = 'sectioning')
    plt.legend()
    plt.show()
    

    Results for one of the sizes and plots:

     *** running test with vectors of size 20000 ***
    naive python
    found: 124425
    101.40657806396484
    
    cdist
    found: 124425
    2.9293079376220703
    
    KDTree
    found: 124425
    18.166933059692383
    
    with sort and sectioning
    found: 124425
    2.3414530754089355
    complete.
    

    Note: In first plot, cdist overlays the sectioning. Playoffs are shown in second plot. smaller sizes

    The "playoffs"

    The playoffs

    Modified sectioning code

    This code finds the minimum within the points within radius. Runtime is equivalent to the sectioning code above.

    print('with sort and sectioning, and min finding')
    result = 0
    pairings = {}  
    tic = time.time()
    B.sort()
    def dist_squared(a, b): 
        # note (x, y) in point b will be inverted (below), so need to index properly
        return (a[0] - b[1])**2 + (a[1] - b[0])**2
    for idx, point in enumerate(A):
        # gather the neighborhood in x-dimension within x-r <= x <= x+r+1
        # if this has any merit, we could "do it again" for y-coord....
        contenders = B[bisect(B,(point[0]-r-delta, 0)) : bisect(B,(point[0]+r+delta, 0))]
        # further chop down to the y-neighborhood
        # flip the coordinate to support bisection by y-value
        contenders = list(map(lambda p: (p[1], p[0]), contenders))
        contenders.sort()
        contenders = contenders[bisect(contenders,(point[1]-r-delta, 0)) : 
                                bisect(contenders,(point[1]+r+delta, 0))]
        matches = [(dist_squared(point, p2), point, p2) for p2 in contenders 
            if dist_squared(point, p2) <= r_squared]
        if matches:
            pairings[idx] = min(matches)[1]  # pair the closest point in B with the point in A
    toc = time.time()
    print(toc-tic)