Search code examples
pythonnumpyknn

Slow Euclidean Distance


I am calculating Euclidean Distance with python code below:

def getNeighbors(trainingSet, testInstance, k, labels):
    distances = []
    for x in range(len(trainingSet)):
        dist = math.sqrt(((testInstance[0] - trainingSet[x][0]) ** 2) +    ((testInstance[1] - trainingSet[x][1]) ** 2))
        distances.append([dist, labels[x]])
    distances = np.array(distances)   
    return distances

For calculating distance of a given point with 10 other points, it's good. But when I calculate distance of a point with 18563 other points, then the computer gets hanged and don't response for around 3 Hrs.

How can I make calculation for 18563 points faster?


Solution

  • You can speed this up by converting to NumPy first, then using vector operations, instead of doing the work in a loop, then converting to NumPy. Something like this:

    trainingArray = np.array(trainingSet)
    distances = ((testInstance[0] - trainingArray[:, 0]) ** 2 +
                 (testInstance[1] - trainingArray[:, 1]) ** 2).sqrt()
    

    (That's obviously untested, because without enough context to know what's actually in those variables I had to guess, but it will be something close to that.)

    There are other things you can do to squeeze out a few extra %—try replacing the ** 2 with self-multiplication or the sqrt with ** .5, or (probably best) replacing the whole thing with np.hypot. (If you don't know how to use timeit—or, even better, IPython and it's %timeit magic—now is a great time to learn.)

    But ultimately, this is just going to give you a constant-multiplier speedup of about an order of magnitude. Maybe it takes 15 minutes instead of 3 hours. That's nice, but… why is it taking 3 hours in the first place? What you're doing here should be taking on the order of seconds, or even less. There's clearly something much bigger wrong here, like maybe you're calling this function N**2 times when you think you're only calling it once. And you really need to fix that part.

    Of course it's still worth doing this. First, element-wise operations are simpler and more readable than loops, and harder to get wrong. Second, even if you reduce the whole program to 3.8 seconds, you'll be happy for the order-of-magnitude speedup to 0.38 seconds, right?