Search code examples
pythonnumpyvectorization

How do you find nearest pairs between sets without using O(N*M) memory?


I have two sets of vectors A (red) and B (blue). I'm looking to find the nearest vector in A to each point in B.

enter image description here

I got this working using:

import numpy as np
import random

def get_assignment(A, B):
    """
    returns `assignment`, where `assignment[i]` means A[assignment[i]] is the nearest point to B[i]
    """
    cost = np.linalg.norm(A[:, np.newaxis, :] - B, axis=2)
    assignment = np.argmin(cost, axis=0)
    return assignment

if __name__ == '__main__':
    An = 3
    Bn = 10
    A = np.array([ [ random.random(), random.random() ] for i in range(An) ]) 
    B = np.array([ [ random.random(), random.random() ] for i in range(Bn) ]) 
    assignment = get_assignment(A, B)

    import pylab

    for i in range(len(A)):
        Bi = np.where(assignment == i)[0]
        for j in Bi:
            x1, y1 = A[i]
            x2, y2 = B[j]
            pylab.plot(x1, y1, 'ro')
            pylab.plot(x2, y2, 'bo')
            pylab.plot([x1, x2], [y1, y2], 'k-', lw=0.5)

    pylab.axis('equal')
    pylab.show()

But I'm looking to do this where A has around 1,000 elements and B 1,000,000. Is there a way to do this using less memory?


Solution

  • I would suggest applying a K-D tree.

    Example:

    import numpy as np
    import random
    import sklearn.neighbors
    
    
    def get_assignment(A, B):
        """
        returns `assignment`, where `assignment[i]` means A[assignment[i]] is the nearest point to B[i]
        """
        tree = sklearn.neighbors.KDTree(A)
        assignment = tree.query(B, return_distance=False).reshape(-1)
        return assignment
    
    
    if __name__ == '__main__':
        An = 1000
        Bn = 1000000
    
        A = np.random.rand(An, 2)
        B = np.random.rand(Bn, 2)
        assignment = get_assignment(A, B)
    

    Measuring this using /usr/bin/time -v, it takes roughly 36 MB to do this query. It's also much faster.