speeding up processing 5 million rows of coordinate data

I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data. I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.

Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):

from geopy.point import Point
from geopy.distance import vincenty
import csv


class CustomGeoPoint(object):
    def __init__(self, latitude, longitude):
        self.location = Point(latitude, longitude)
        self.close_to_another_point = False


try:
    output = open('output.csv','w')
    writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
    writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])

    # 5 miles
    close_limit = 5
    geo_points = []

    with open('geo_input.csv', newline='') as geo_csv:
        reader = csv.reader(geo_csv)
        next(reader, None) # skip the headers
        for row in reader:
            geo_points.append(CustomGeoPoint(row[0], row[1]))

    # for every point, look at every point until one is found within 5 miles
    for geo_point in geo_points:
        for geo_point2 in geo_points:
            dist = vincenty(geo_point.location, geo_point2.location).miles
            if 0 < dist <= close_limit: # (0,close_limit]
                geo_point.close_to_another_point = True
                break
        writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
                         geo_point.close_to_another_point])

finally:
    output.close()

As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!

I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.

So any pointers on how to make this faster?

Solution

Let's look at what you're doing.

You read all the points into a list named geo_points.

Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.
You loop over all the geo_points. That's 5 million, according to you.
Within the outer loop, you loop again over all 5 million geo_points.
You compute the distance in miles between the two loop items.
If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.
When the inner loop stops, you write information about the outer loop item to a CSV file.

Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.

This is what O(n²) means.

The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?

Anyway, you have some problems.

Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.

Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.

Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).

Answers?

Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.

As @SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!

Sorted List

This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.

How would that work? Same way as before, except your inner loop would have some checks like this:

five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times

for i, pi in enumerate(geo_points):

    if pi.close_to_another_point:
        continue   # Remember if close to an earlier point

    pi0max = pi[0] + five_miles
    pi1min = pi[1] - five_miles
    pi1max = pi[1] + five_miles

    for j in range(i+1, list_len):
        pj = geo_points[j]
        # Assumes geo_points is sorted on [0] then [1]
        if pj[0] > pi0max:
            # Can't possibly be close enough, nor any later points
            break
        if pj[1] < pi1min or pj[1] > pi1max:
            # Can't be close enough, but a later point might be
            continue

        # Now do "real" comparison using accurate functions.
        if ...:
            pi.close_to_another_point = True
            pj.close_to_another_point = True
            break

What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.

If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.

Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.

Unsorted List

But what if the list is not sorted?

@RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:

def insert_node(node, treenode, depth=0):
    dimension = depth % 2  # even/odd -> lat/long
    dn = node.coord[dimension]
    dt = treenode.coord[dimension]

    if dn < dt:
        # go left
        if treenode.left is None:
            treenode.left = node
        else:
            insert_node(node, treenode.left, depth+1)
    else:
        # go right
        if treenode.right is None:
            treenode.right = node
        else:
            insert_node(node, treenode.right, depth+1)

What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)

It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from @RootTwo provides an algorithm).

Advice

My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.

Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.