Search code examples
pythongraph-algorithmdijkstra

Redundant Checks in Python Implementation of Dijkstra's Algorithm


EDIT: I've added some output to highlight what I believe the problem is.

There are so many versions of Dijkstra's Algorithm out there, and when you are learning it is hard to assess their quality.

The implementation below appears to be from a reputable source (https://bradfieldcs.com/algos/graphs/dijkstras-algorithm/)

However, it seems that since this version doesn't keep track of visited nodes, these lines:

for neighbor, weight in graph[current_vertex].items():
            distance = current_distance + weight

potentially create a lot of unnecessary checking.

The output from the code below is:

Neighbor: V, weight: 6, distance: 6
Neighbor: W, weight: 7, distance: 7
Neighbor: U, weight: 6, distance: 12
Neighbor: X, weight: 10, distance: 16
Neighbor: U, weight: 7, distance: 14
Neighbor: X, weight: 1, distance: 8
Neighbor: W, weight: 1, distance: 9
Neighbor: V, weight: 10, distance: 18
{'U': 0, 'V': 6, 'W': 7, 'X': 8}

To me this suggests that the algorithm is doing unnecessary work, as node U becomes a neighbor multiple times, its distance is calculated as twice the distance already calculated, and therefore it is rejected. My understanding is that once a node is processed, it no longer needs to be considered. I may be misunderstanding the algorithm, but this looks suspicious to me.

Since keeping track of visited nodes seems integral to the definition of Dijkstra's Algorithm, is it fair to say that this particular implementation is "not great"? Or am I missing something?

It would be great to see a "best practices" version of Dijkstra's Algorithm in Python, preferably using the same kind of structure for the graph.

import heapq


def calculate_distances(graph, starting_vertex):
    distances = {vertex: float('infinity') for vertex in graph}
    distances[starting_vertex] = 0

    pq = [(0, starting_vertex)]
    while len(pq) > 0:
        current_distance, current_vertex = heapq.heappop(pq)

        # Nodes can get added to the priority queue multiple times. We only
        # process a vertex the first time we remove it from the priority queue.
        if current_distance > distances[current_vertex]:
            continue

        for neighbor, weight in graph[current_vertex].items():
            distance = current_distance + weight
            print(f"Neighbor: {neighbor}, weight: {weight}, distance: {distance}")

            # Only consider this new path if it's better than any path we've
            # already found.
            if distance < distances[neighbor]:
                distances[neighbor] = distance
                heapq.heappush(pq, (distance, neighbor))

    return distances


example_graph = {
    'U': {'V': 6, 'W': 7},
    'V': {'U': 6, 'X': 10},
    'W': {'U': 7, 'X': 1},
    'X': {'W': 1, 'V': 10}
}
print(calculate_distances(example_graph, 'U'))

Solution

  • My understanding is that once a node is processed, it no longer needs to be considered.

    If you mean with "considered" that its distance along the path is calculated, then this is true, but also consider that comparing a distance with the best value so far is not significantly more complex than checking whether a neighbor was already visited. In either case (algorithm), a truly visited node (i.e. a node that has been popped from the heap) will never be pushed unto the heap again.

    Let's look at a variant of the algorithm where (only) the concept of "visited" is used to determine whether a neighbor should be put on the heap. I intentionally have tried to limit code changes, so the differences can be highlighted better:

    INF = float('infinity')
    def calculate_distances_2(graph, starting_vertex):
        distances = {vertex: INF for vertex in graph}
        pq = [(0, starting_vertex)]
        while len(pq) > 0:
            current_distance, current_vertex = heapq.heappop(pq)
            if distances[current_vertex] != INF:  # Already visited?
                continue
            distances[current_vertex] = current_distance
            for neighbor, weight in graph[current_vertex].items():
                print(f"Neighbor: {neighbor}, weight: {weight}, goes on heap? {distances[neighbor] == INF}")
                if distances[neighbor] == INF:  # Not yet visited?
                    heapq.heappush(pq, (current_distance + weight, neighbor))
        return distances
    

    So what is different here?

    • The distance of a node is only set when the node is popped of the heap, and this also serves for marking a node as visited: it no longer has Infinity as associated distance. This means that :

      • we don't set distances[starting_vertex] = 0 before the loop starts.
      • we only check whether a neighbor has been visited (implicitly, by checking distances[starting_vertex] is Infinity or not), but don't compare whether the current neighbor's distance is an improvement. This is entirely left to the heap mechanics now
    • A neighbor's distance along the current path does not have to be calculated when the node was already visited.

    The first point practically means that the second algorithm may push a node on the heap (again), while the first algorithm might not. In the worst case there is no difference, but in random cases we can expect such a difference to occur. This is because the first algorithm uses more information: when the same node is already present one or more times on the heap, the first algorithm knows the shortest distance among the traversed paths to that node, while the second algorithm "only" knows that this node has not yet been visited (i.e. has not yet been popped).

    Concrete example

    For your example there is no difference. I tried with this graph:

    enter image description here

    ...and used the code below to make the comparison. Note that I changed your print call: I removed the output of distance (as in the second algorithm it is not yet calculated), and added one more information: whether the neighbor will be pushed on the heap or not (False/True):

    import heapq
    
    INF = float('infinity')
    
    def calculate_distances(graph, starting_vertex):
        distances = {vertex: INF for vertex in graph}
        distances[starting_vertex] = 0
    
        pq = [(0, starting_vertex)]
        while len(pq) > 0:
            current_distance, current_vertex = heapq.heappop(pq)
            if current_distance > distances[current_vertex]:
                continue
            for neighbor, weight in graph[current_vertex].items():
                distance = current_distance + weight
                print(f"Neighbor: {neighbor}, weight: {weight}, goes on heap?: {distance < distances[neighbor]}")
                if distance < distances[neighbor]:
                    distances[neighbor] = distance
                    heapq.heappush(pq, (distance, neighbor))
        return distances
    
    
    ### Alternative
    
    def calculate_distances_2(graph, starting_vertex):
        distances = {vertex: INF for vertex in graph}
        pq = [(0, starting_vertex)]
        while len(pq) > 0:
            current_distance, current_vertex = heapq.heappop(pq)
            if distances[current_vertex] != INF:
                continue
            distances[current_vertex] = current_distance
            for neighbor, weight in graph[current_vertex].items():
                print(f"Neighbor: {neighbor}, weight: {weight}, goes on heap? {distances[neighbor] == INF}")
                if distances[neighbor] == INF:
                    heapq.heappush(pq, (current_distance + weight, neighbor))
        return distances
    
    
    example_graph = {
        "0": { "1": 2, "2": 6 },
        "1": { "0": 2, "3": 5 },
        "2": { "0": 6, "3": 8 },
        "3": { "1": 5, "2": 8, "4": 10, "5": 15 },
        "4": { "3": 10, "5": 6, "6": 2 },
        "5": { "3": 15, "4": 6, "6": 6 },
        "6": { "4": 2, "5": 6 }
    }
    
    print(calculate_distances(example_graph, '0'))
    print(calculate_distances_2(example_graph, '0'))
    

    I provide here the output that is generated by the first algorithm only, and mark the lines where the second algorithm has a different output:

    Neighbor: 1, weight: 2, goes on heap?: True
    Neighbor: 2, weight: 6, goes on heap?: True
    Neighbor: 0, weight: 2, goes on heap?: False
    Neighbor: 3, weight: 5, goes on heap?: True
    Neighbor: 0, weight: 6, goes on heap?: False
    Neighbor: 3, weight: 8, goes on heap?: False ****
    Neighbor: 1, weight: 5, goes on heap?: False
    Neighbor: 2, weight: 8, goes on heap?: False
    Neighbor: 4, weight: 10, goes on heap?: True
    Neighbor: 5, weight: 15, goes on heap?: True
    Neighbor: 3, weight: 10, goes on heap?: False
    Neighbor: 5, weight: 6, goes on heap?: False ****
    Neighbor: 6, weight: 2, goes on heap?: True
    Neighbor: 4, weight: 2, goes on heap?: False
    Neighbor: 5, weight: 6, goes on heap?: False ****
    Neighbor: 3, weight: 15, goes on heap?: False
    Neighbor: 4, weight: 6, goes on heap?: False
    Neighbor: 6, weight: 6, goes on heap?: False
    {'0': 0, '1': 2, '2': 6, '3': 7, '4': 17, '5': 22, '6': 19}
    

    The places where the output is different (3 places) indicate where the first algorithm outputs False and the second True.

    Conclusion

    Attribute First algorithm Second algorithm
    Heap size Better Worse
    Additions Worse Better

    The heap size will in random cases be more determining for execution times, and so the the first algorithm is expected to run slightly faster.