Faulty conditional check in nearest neighbor calculation?

I'm trying to write a function to calculate an approximate traveling salesman route through a list of cities using the nearest neighbor algorithm, starting from the first city listed. However, each time I run it I get IndexError: list index out of range.

In debugging the error, I found that the value of index stays the same from one loop iteration to the next, rather than changing. When it's time to append, the code checks the if not in condition; since it's False, it adds 1 to i and moves to the next iteration of the loop. Once it reaches a higher number than exists in the array, it gives me the error.

So my problem is, why does execution not enter the first if not in block? It seems like the code ignores it.

For my actual problem I'm reading in a file of 317 cities, each of which has an index and two coordinates. Here's a shorter sample list of cities for testing:

Nodelist = [
    (1, 63, 71),
    (2, 94, 71),
    (3, 142, 370),
    (4, 173, 1276),
    (5, 205, 1213),
    (6, 213, 69),
    (7, 244, 69),
    (8, 276, 630),
    (9, 283, 732),
    (10, 362, 69),
]

Here is the function's code:

def Nearest(Nodelist,Distance,index):
    Time_Calculation = time.time()
    Newarray=[]
    Newarray.append(Nodelist[0])
    for i in range(0,len(Nodelist)):
        for j in range(1,len(Nodelist)):
            if (Nodelist[j] not in Newarray):
                DisEquation = math.sqrt(pow(Nodelist[j][1]-Newarray[i][1],2)+pow(Nodelist[j][2]-Newarray[i][2],2))
                if Distance==0:
                    Distance=DisEquation
                if Distance > DisEquation:
                    index=j
                    Distance=DisEquation
        if(Nodelist[index] not in Newarray):
            Newarray.append(Nodelist[index])
        Distance=0
    print (time.time() - Time_Calculation)
    return Newarray

Code that calls it:

NearestMethodArr=Nearest(Cities,b,index)
print(NearestMethodArr)
print(len(NearestMethodArr))

The print statements should produce:

[(1, 63, 71), (2, 94, 71), (6, 213, 69), (7, 244, 69), (10, 362, 69), (3, 142, 370), (8, 276, 630), (9, 283, 732), (5, 205, 1213), (4, 173, 1276)]
10

Solution

Newarray.append(Nodelist[0])#adding Nodelist[0] to NewArray    
for i in range(0,len(Nodelist)):
        for j in range(1,len(Nodelist)):
            if (Nodelist[j] not in Newarray):
                DisEquation = math.sqrt(pow(Nodelist[j][1]-Newarray[i [1],2)+pow(Nodelist[j][2]-Newarray[i][2],2)) #you access Newarray at i
                if Distance==0:
                    Distance=DisEquation
                if Distance > DisEquation:
                    index=j
                    Distance=DisEquation
        if(Nodelist[index] not in Newarray):
            Newarray.append(Nodelist[index])#you conditionally add new elements to newarray

If you see the comments I added to your code the issues should be clear. You loop over all the elements of Nodelist and call the index i you've already added one element to NewArray so the first time index 0 exists. Then you hit the Nodelist[index] not in Newarray, if it is true NewArray gets bigger by 1 and then NewArray[1] works if this is ever not true for any reason then NewArray stays the same size and the next NewArray[i] will be an index out of range error.

Edit: thanks to CrazyChucky for setting me straight in the comments. I've adjusted below

My comment about the failure was correct though I hadn't identified the root cause which was not setting index as the author identified. I didn't parse the code correctly in my head. The code in the updated version works though it would be faster and easier to read if you did something like:

def new_nearest(Nodelist):
    start_time = time.time()
    shortest_path = [Nodelist[0]]
    Nodelist = Nodelist[1:]
    while len(Nodelist) > 0:
        shortest_dist_sqr = -1
        next_node = None
        for potential_dest in Nodelist:
            dist_sqr = (shortest_path[-1][1] - potential_dest[1])**2 + (shortest_path[-1][2] - potential_dest[2])**2 #you don't keep the distances so there is no need to sqrt as if a > b then a**2 > b**2
            if shortest_dist_sqr < 0 or dist_sqr < shortest_dist_sqr:
                next_node = potential_dest
                shortest_dist_sqr = dist_sqr
        shortest_path.append(next_node)
        Nodelist.remove(next_node)
    print(time.time() - start_time)
    return shortest_path

This returns identical results but executes faster. changing to an approach where you remove nodes from the inner loop makes it clearer what's going on, it might make the code a bit slower (it would in C but python has a lot of overhead in various places that may make this a net gain,) and because there is no need to calculate the actual distance since you don't store it you can compare the distance squared and not do any square roots. If you do want the distance you can just square root the nearest node after you've determined which it is.

Edit: I couldn't help but check. The move to removing nodes from Nodelist actually represents the majority of the time savings while the lack of a sqrt does reliably speed things up (I used timeit and varied the code.) In a lower level language doing tiny things is ultra fast so it's likely faster to leave the array alone and just skip over elements that are already used (this might not actually be true because it will mess with branch prediction performance analysis is really hard and will depend on what processor architecture you're using...) In python though even small things are very expensive (add two variables: figure out what types they are, decode variable byte length ints, do add, create new object for result...) so even though removing a value from a list may be more expensive than skipping values and leaving the list alone it will lead to more small operations that are very slow in Python. If a low level language was in use you could also recognise that the order of the nodes is arbitrary (except for which one is first,) so you can just have an array with all the nodes and instead of making a new small array you track the length of the values used in the array and copy the last value in the array over the value that is selected for the next node in the path.

Edit yet again :P : Again I couldn't help but be curious. The approach to overwriting part of the nodelist instead of removing entries had me wondering as to if it would be faster in python as it does create more work that is slow in python but decreases the amount of work involved in removing node elements. Turns out that this approach is a noticeable (though not very significant a little less than 2% but consistent in micro benchmarks) speedup even in python so the below code is even faster:

def newer_nearest(Nodelist):
    shortest_path = [Nodelist[0]]
    Nodelist = Nodelist[1:]
    while len(Nodelist) > 0:
        shortest_dist_sqr = -1
        next_node = None
        for index, potential_dest in enumerate(Nodelist):
            dist_sqr = (shortest_path[-1][1] - potential_dest[1])**2 + (shortest_path[-1][2] - potential_dest[2])**2 #you don't keep the distances so there is no need to sqrt as if a > b then a**2 > b**2
            if shortest_dist_sqr < 0 or dist_sqr < shortest_dist_sqr:
                next_node = index
                shortest_dist_sqr = dist_sqr
        shortest_path.append(Nodelist[next_node])
        Nodelist[next_node] = Nodelist[-1]
        Nodelist = Nodelist[:-1]
    return shortest_path