algorithm language-agnostic nearest-neighbor

How should I find nearest neighbors for every element in a list?

I have two sets of integers A and B (size of A less than or equal to B), and I want to answer the question, "How close is A to B?". The way I want to answer this question is by producing a measure of how far you have to go from a given a in A to find a b in B.

The specific measure I want to produce does the following: for each a, find the closest b, the only catch being that once I match a b with an a, I can no longer use that b to match any other a's. (EDIT: the algorithm I'm trying to implement will always prefer a shorter match. So if b is the nearest neighbor to more than one a, pick the a nearest to b. I'm not sure what to do if more than one a has the same distance to b, right now I'm picking the a that precedes b, but that's quite arbitrary and not necessarily optimal.) The measure that I'll for make these sets, the final product, is a histogram showing the number of pairs in the vertical axis and the distance of the pairs in the x-axis.

So if A = {1, 3, 4} and B = {1, 5, 6, 7}, I will get the following a,b pairs: 1,1, 4,5, 3,6. For these data, the histogram should show one pair with distance zero, one pair with distance 1, and one pair with distance 3.

(The actual size of these sets has an upper bound around 100,000 elements, and I read them in from disk already sorted low to high. The integers range from 1 to ~20,000,000. EDIT: also, the elements of A and B are unique, i.e. no repeated elements.)

The solution I've come up with feels a bit clunky. I'm using Perl, but the problem is more or less language agnostic.

First I make a hash, with one key for each number that appears in the union of A and B and values indicating whether each number appears in A, B, or both, e.g. $hash{5} = {a=>1, b=>1} if the number 5 appears in both data-sets. (If it only appeared in A, you'd have $hash{5} = {a=>1}.)
Next, I iterate over A to find all the hash elements that appear in A and B, mark them in the measure, and remove them from the hash.
Then, I sort all the hash keys and make each element of the hash point to its nearest neighbors, like a linked list, where a given hash element now looks like $hash{6} = {b=>1, previous=>4, next=>8}. The linked list doesn't know whether the next and previous elements are in A or B.
Then I loop over pair distances starting at d=1, and find all pairs with distance d, mark them, remove them from the hash, until there are no more elements of A to match.

The loop looks like this:

for ($d=1; @a > 0; $d++) {
    @left = ();
    foreach $a in @a {
        $next = $a;
        # find closest b ahead of $a, stop searching if you pass $d
        while (exists $hash{$next}{next} && $next - $a < $d) {
            $next = $hash{$next}{next};
        }
        if ($next is in B && $next - $a == $d) {
            # found a pair at distance $d
            mark_in_measure($a, $next);
            remove_from_linked_list($next);
            remove_from_linked_list($a);
            next;
        }

        # do same thing looking behind $a
        $prev = $a;
        ...

        # you didn't find a match for $a
        push @left, $a;
    }
    @a = @left;
}

This loop obviously prefers pairs that match b's that appear later than a's; I can't tell whether there's a sensible way to decide whether later is better than prior (better in terms of getting closer pairs). The main optimization I'm interested in is processing time.

Solution

Sounds like you have a particular case of the Assignment Problem (finding a minimum matching in a weighted bipartite graph).

The algorithm to solve the assignment problem is too slow for you at O(N^3) but I'm pretty sure there you can shave some of this complexity off by exploiting your particular weight function or how you only want a histogram instead of the exact matching.