Search code examples
algorithmdivide-and-conquerclrs

Can I compute closest split pair of points where distance is strictly less than delta


The Closest Pair of Points Problem has been intriguing me lately. More specifically, the divide and conquer algorithm.

This recursive algorithm requires me to split a set of points into two chunks, a and b and compute the closest pair of points for each chunk, then compute the closest pair of points between the chunks and return the minimum of these 3 quantities.

The algorithm to compute the closest point between the chunks, works by only iterating over at most 7 points that are at most min(a, b) away from the last point in a (meadian element of the whole set).

These two pictures represent the problem a little better. min(a, b) = Delta.

enter image description here

I understand that if I were to have 2 points on the l line on both sides, a and b, I'd need to compare at most 7 points in the middle strip.

What I am wondering though is that if I construct the middle strip with points strictly less than Delta away from l, couldn't I just compare the next 4 points, instead of 7, since I can only fit 2 points on each side of l, less than Delta away from l and Delta away from each other?

Edit: I have started a bounty for a very similar question on cs stackexchange also and we have come up with a very interesting discussion there. So I'm linking it here. We've also started a very insightful chat here.


Solution

  • Note: this answer was updated for future references basing on exensive discussions with @Danilo Souza Morães and @burnabyRails who is the author of the accepted answer on a similar question by Danilo on the CS site. The main problem with the original answer is that I assumed that a bit different algorithm is used/discussed here. Since the original answer provided some important insights to Danilo it is preserved as is at the end. Also if you are going to read the discussion mentioned by Danilo in his answer, be sure to read the Intro here first to better understand what I meant. The added preface discusses different versions of the algorithm in question.

    Intro

    This algorithm is based on two main ideas:

    1. Typical recursive divide and conquer approach
    2. The fact that if you already know the best distance in both left and right areas, you can do the "combine" step in O(N) time because you can prove that you can do checks for each point in O(1) time.

    Nevertheless there are several ways to implement the same base ideas in practice, particularly the combining step. The main differences are:

    1. Do you treat "left" and "right" points independently and possible in different ways?
    2. For each point do you do a fixed number of checks or do you filter the candidates first and only do distance checks for the filtered candidates? Note that you can do some filtering without breaking O(N*log(N)) time if you can ensure it is done in amortized O(1). On other words if you know there is an strong upper bound on the number of good candidates, you don't have to check exactly that number of candidates.

    The classical description of the algorithm in the CLRS Introduction to Algorithms clearly answers the question #1 as you mix the "left" and "right" points and treat them jointly. As for #2 the answer is less clear (to me) but it probably implies checking some fixed number of points. This seems to be the version that Danilo had in mind asking the question.

    The algorithm I had in mind is different on both points: it goes up through the "left" points and check them against all the filtered "right" candidates. Obviously I was not aware of this difference while I was writing my answer and during the initial discussion in the chat.

    "My" algorithm

    Here is a sketch of the algorithm I had in mind.

    1. The starting steps are common: We already have processed "left" and "right" points and know the best 𝛿 sofar, and we have them sorted along the Y axis. We also filtered out all the points that do not lie in the ±𝛿 strips.

    2. The outer loop is that we go through the "left" points up. Now assume we process one, let's call it "the left point". Additionally and semi-independently we iterate over the "right" points moving the "start position" when necessary (see step #2).

    3. For each left point move up from the last start position in the right points and increment the start position until we get into the range of 𝛿 (or rather -𝛿) difference with the left point according to the Y axis. (Note: separation of the "left" and the "right" points is what makes it necessary to start at -𝛿)

    4. Now from that start position continue going up and calculate distance to all the points that lie before the +𝛿 with the current left point.

    The filtering done in steps #2 and #3 is what makes it "data-dependent". The trade-off in this implementation is that you do less distance checks at a cost of more Y-checks. Also the code is arguable more complicated.

    Why this combining algorithm works in O(N)? For the same reason that there is a fixed bound (i.e. O(1)) on how many points you can fit into a rectangle 𝛿x2𝛿 such that the distance between each two points is at least 𝛿. It implies that there will be O(1) checks made at the step #3. In practice all the distances checked by this algorithm will be checked by the CLRS-version and depending on the data some more might also be checked.

    The amortized cost of the step #2 is also O(1) because the total cost of the step #2 over the whole outer loop is O(n): you clearly can't move the start position up more times than there are "right points" in total.

    Modified CLRS algorithm

    You can easily make a different decision on #2 even in the CLRS-version of the algorithm. The description of the crucial step says:

    1. For each point p in the array Y′, the algorithm tries to find points in Y′ that are within δ units of p. As we shall see shortly, only the 7 points in Y′ that follow p need be considered. The algorithm computes the distance from p to each of these 7 points and keeps track of the closest-pair distance δ′ found over all pairs of points in Y′

    It is easy to modify it to check not 7 points but first check whether the difference in Y is less than 𝛿. Obviously you are still guaranteed to need to check not more than 7 points. Again the trade-off is that you do less distance checks but do some Y-difference checks. Depending on your data and on relative performance of different operations on your hardware it might be a good or a bad trade-off.

    Some other important ideas

    1. If you don't need to find all the pairs with the minimal distance, you can safely use <𝛿`` instead of<=𝛿` while filtering

    2. In the world of the real hardware with floating point representation of the numbers the idea of = is not that clear. Usually you need to check something like abs(a - b) < έ anyway.

    3. Idea behind my counter-example (to a different algorithm): you don't have to put all the points on the edges. An equilateral triangle with side 𝛿 can fit into a square with size 𝛿-έ (which is the other way to say <𝛿)


    Original Answer

    I don't think that you correctly understand the meaning of that constant 7 in this algorithm. In practice you don't check 4 or 7 or any fixed number of points, you go up (or down) along the Y axis and check the ones that fall into the corresponding rectangle. There easily might be just 0 such points. What is really important for the algorithm to work in promised O(n*log(n)) time is that there is a fixed upper bound on the number of points checked at that step. Any constant upper bound would work as long as you can prove it. In other words it is important that that step is O(n), not the particular constant. 7 is just one that is relatively easy to prove and that's all.

    I believe the practical reason for 7 is that on real hardware you can't make exact computations of distances in floating-point data types, you'll bound to have some rounding errors. That's why using < instead of <= is not really practical.

    And finally I don't think that 4 is a correct bound assuming you can reliably do <. Let's assume 𝛿 = 1. Let's the "left" point be (-0.0001; 0) so the "right" < rectangle for it is 0 <= x < 1 and -1 < y < 1. Consider following 5 points (the idea is: 4 points almost in the corners just to fit the rectangle < 𝛿 and the distances between them > 𝛿, and the 5th is at the center of the rectangle):

    • P1 = (0.001; 0.999)
    • P2 = (0.999; 0.93)
    • P3 = (0.001; -0.999)
    • P4 = (0.999; -0.93)
    • P5 = (0.5; 0)

    Note that those 5 points should be more than 𝛿 between themselves, some of them might be less than 𝛿 from the "left" point. This is why we check for them in the first place.

    Distance P1-P2 (and symmetrically P3-P4) is

    sqrt(0.998^2 + 0.069^2) = sqrt(0.996004 + 0.004761) = sqrt(1.000765) > 1
    

    Distance P1-P5 (and symmetrically P3-P5) is

    sqrt(0.499^2 + 0.999^2) = sqrt(0.249001 + 0.998001) = sqrt(1.247002) > 1
    

    Distance P2-P5 (and symmetrically P4-P5) is

    sqrt(0.499^2 + 0.93^2) = sqrt(0.249001 + 0.8649) = sqrt(1.113901) > 1
    

    So you can fit 5 points in such a < rectangle which is clearly more than 4.