EDIT: I've added some output to highlight what I believe the problem is.
There are so many versions of Dijkstra's Algorithm out there, and when you are learning it is hard to assess their quality.
The implementation below appears to be from a reputable source (https://bradfieldcs.com/algos/graphs/dijkstras-algorithm/)
However, it seems that since this version doesn't keep track of visited nodes, these lines:
for neighbor, weight in graph[current_vertex].items():
distance = current_distance + weight
potentially create a lot of unnecessary checking.
The output from the code below is:
Neighbor: V, weight: 6, distance: 6
Neighbor: W, weight: 7, distance: 7
Neighbor: U, weight: 6, distance: 12
Neighbor: X, weight: 10, distance: 16
Neighbor: U, weight: 7, distance: 14
Neighbor: X, weight: 1, distance: 8
Neighbor: W, weight: 1, distance: 9
Neighbor: V, weight: 10, distance: 18
{'U': 0, 'V': 6, 'W': 7, 'X': 8}
To me this suggests that the algorithm is doing unnecessary work, as node U
becomes a neighbor multiple times, its distance is calculated as twice the distance already calculated, and therefore it is rejected. My understanding is that once a node is processed, it no longer needs to be considered. I may be misunderstanding the algorithm, but this looks suspicious to me.
Since keeping track of visited nodes seems integral to the definition of Dijkstra's Algorithm, is it fair to say that this particular implementation is "not great"? Or am I missing something?
It would be great to see a "best practices" version of Dijkstra's Algorithm in Python, preferably using the same kind of structure for the graph.
import heapq
def calculate_distances(graph, starting_vertex):
distances = {vertex: float('infinity') for vertex in graph}
distances[starting_vertex] = 0
pq = [(0, starting_vertex)]
while len(pq) > 0:
current_distance, current_vertex = heapq.heappop(pq)
# Nodes can get added to the priority queue multiple times. We only
# process a vertex the first time we remove it from the priority queue.
if current_distance > distances[current_vertex]:
continue
for neighbor, weight in graph[current_vertex].items():
distance = current_distance + weight
print(f"Neighbor: {neighbor}, weight: {weight}, distance: {distance}")
# Only consider this new path if it's better than any path we've
# already found.
if distance < distances[neighbor]:
distances[neighbor] = distance
heapq.heappush(pq, (distance, neighbor))
return distances
example_graph = {
'U': {'V': 6, 'W': 7},
'V': {'U': 6, 'X': 10},
'W': {'U': 7, 'X': 1},
'X': {'W': 1, 'V': 10}
}
print(calculate_distances(example_graph, 'U'))
My understanding is that once a node is processed, it no longer needs to be considered.
If you mean with "considered" that its distance along the path is calculated, then this is true, but also consider that comparing a distance with the best value so far is not significantly more complex than checking whether a neighbor was already visited. In either case (algorithm), a truly visited node (i.e. a node that has been popped from the heap) will never be pushed unto the heap again.
Let's look at a variant of the algorithm where (only) the concept of "visited" is used to determine whether a neighbor should be put on the heap. I intentionally have tried to limit code changes, so the differences can be highlighted better:
INF = float('infinity')
def calculate_distances_2(graph, starting_vertex):
distances = {vertex: INF for vertex in graph}
pq = [(0, starting_vertex)]
while len(pq) > 0:
current_distance, current_vertex = heapq.heappop(pq)
if distances[current_vertex] != INF: # Already visited?
continue
distances[current_vertex] = current_distance
for neighbor, weight in graph[current_vertex].items():
print(f"Neighbor: {neighbor}, weight: {weight}, goes on heap? {distances[neighbor] == INF}")
if distances[neighbor] == INF: # Not yet visited?
heapq.heappush(pq, (current_distance + weight, neighbor))
return distances
So what is different here?
The distance of a node is only set when the node is popped of the heap, and this also serves for marking a node as visited: it no longer has Infinity as associated distance. This means that :
distances[starting_vertex] = 0
before the loop starts.distances[starting_vertex]
is Infinity or not), but don't compare whether the current neighbor's distance is an improvement. This is entirely left to the heap mechanics nowA neighbor's distance along the current path does not have to be calculated when the node was already visited.
The first point practically means that the second algorithm may push a node on the heap (again), while the first algorithm might not. In the worst case there is no difference, but in random cases we can expect such a difference to occur. This is because the first algorithm uses more information: when the same node is already present one or more times on the heap, the first algorithm knows the shortest distance among the traversed paths to that node, while the second algorithm "only" knows that this node has not yet been visited (i.e. has not yet been popped).
For your example there is no difference. I tried with this graph:
...and used the code below to make the comparison. Note that I changed your print
call: I removed the output of distance
(as in the second algorithm it is not yet calculated), and added one more information: whether the neighbor will be pushed on the heap or not (False/True):
import heapq
INF = float('infinity')
def calculate_distances(graph, starting_vertex):
distances = {vertex: INF for vertex in graph}
distances[starting_vertex] = 0
pq = [(0, starting_vertex)]
while len(pq) > 0:
current_distance, current_vertex = heapq.heappop(pq)
if current_distance > distances[current_vertex]:
continue
for neighbor, weight in graph[current_vertex].items():
distance = current_distance + weight
print(f"Neighbor: {neighbor}, weight: {weight}, goes on heap?: {distance < distances[neighbor]}")
if distance < distances[neighbor]:
distances[neighbor] = distance
heapq.heappush(pq, (distance, neighbor))
return distances
### Alternative
def calculate_distances_2(graph, starting_vertex):
distances = {vertex: INF for vertex in graph}
pq = [(0, starting_vertex)]
while len(pq) > 0:
current_distance, current_vertex = heapq.heappop(pq)
if distances[current_vertex] != INF:
continue
distances[current_vertex] = current_distance
for neighbor, weight in graph[current_vertex].items():
print(f"Neighbor: {neighbor}, weight: {weight}, goes on heap? {distances[neighbor] == INF}")
if distances[neighbor] == INF:
heapq.heappush(pq, (current_distance + weight, neighbor))
return distances
example_graph = {
"0": { "1": 2, "2": 6 },
"1": { "0": 2, "3": 5 },
"2": { "0": 6, "3": 8 },
"3": { "1": 5, "2": 8, "4": 10, "5": 15 },
"4": { "3": 10, "5": 6, "6": 2 },
"5": { "3": 15, "4": 6, "6": 6 },
"6": { "4": 2, "5": 6 }
}
print(calculate_distances(example_graph, '0'))
print(calculate_distances_2(example_graph, '0'))
I provide here the output that is generated by the first algorithm only, and mark the lines where the second algorithm has a different output:
Neighbor: 1, weight: 2, goes on heap?: True
Neighbor: 2, weight: 6, goes on heap?: True
Neighbor: 0, weight: 2, goes on heap?: False
Neighbor: 3, weight: 5, goes on heap?: True
Neighbor: 0, weight: 6, goes on heap?: False
Neighbor: 3, weight: 8, goes on heap?: False ****
Neighbor: 1, weight: 5, goes on heap?: False
Neighbor: 2, weight: 8, goes on heap?: False
Neighbor: 4, weight: 10, goes on heap?: True
Neighbor: 5, weight: 15, goes on heap?: True
Neighbor: 3, weight: 10, goes on heap?: False
Neighbor: 5, weight: 6, goes on heap?: False ****
Neighbor: 6, weight: 2, goes on heap?: True
Neighbor: 4, weight: 2, goes on heap?: False
Neighbor: 5, weight: 6, goes on heap?: False ****
Neighbor: 3, weight: 15, goes on heap?: False
Neighbor: 4, weight: 6, goes on heap?: False
Neighbor: 6, weight: 6, goes on heap?: False
{'0': 0, '1': 2, '2': 6, '3': 7, '4': 17, '5': 22, '6': 19}
The places where the output is different (3 places) indicate where the first algorithm outputs False
and the second True
.
Attribute | First algorithm | Second algorithm |
---|---|---|
Heap size | Better | Worse |
Additions | Worse | Better |
The heap size will in random cases be more determining for execution times, and so the the first algorithm is expected to run slightly faster.