How do you find nearest pairs between sets without using O(N*M) memory?

I have two sets of vectors A (red) and B (blue). I'm looking to find the nearest vector in A to each point in B.

I got this working using:

import numpy as np
import random

def get_assignment(A, B):
    """
    returns `assignment`, where `assignment[i]` means A[assignment[i]] is the nearest point to B[i]
    """
    cost = np.linalg.norm(A[:, np.newaxis, :] - B, axis=2)
    assignment = np.argmin(cost, axis=0)
    return assignment

if __name__ == '__main__':
    An = 3
    Bn = 10
    A = np.array([ [ random.random(), random.random() ] for i in range(An) ]) 
    B = np.array([ [ random.random(), random.random() ] for i in range(Bn) ]) 
    assignment = get_assignment(A, B)

    import pylab

    for i in range(len(A)):
        Bi = np.where(assignment == i)[0]
        for j in Bi:
            x1, y1 = A[i]
            x2, y2 = B[j]
            pylab.plot(x1, y1, 'ro')
            pylab.plot(x2, y2, 'bo')
            pylab.plot([x1, x2], [y1, y2], 'k-', lw=0.5)

    pylab.axis('equal')
    pylab.show()

But I'm looking to do this where A has around 1,000 elements and B 1,000,000. Is there a way to do this using less memory?

Solution

I would suggest applying a K-D tree.

Example:

import numpy as np
import random
import sklearn.neighbors


def get_assignment(A, B):
    """
    returns `assignment`, where `assignment[i]` means A[assignment[i]] is the nearest point to B[i]
    """
    tree = sklearn.neighbors.KDTree(A)
    assignment = tree.query(B, return_distance=False).reshape(-1)
    return assignment


if __name__ == '__main__':
    An = 1000
    Bn = 1000000

    A = np.random.rand(An, 2)
    B = np.random.rand(Bn, 2)
    assignment = get_assignment(A, B)

Measuring this using /usr/bin/time -v, it takes roughly 36 MB to do this query. It's also much faster.