c++minimum-spanning-tree kruskals-algorithm union-find

How to use union-find, minheap, Kruskal's, and a sort algorithm to create a minimum cost spanning tree? (C++)

I apologize if this question is a bit broad, but I'm having a difficult time trying to understand how I would create a minimum cost spanning tree. This is in C++ if it matters at all.

From what I understand, you would use Kruskal's to select the minimum cost edges for building the spanning tree. My thinking is to read the edges into a minheap and that way you can remove from the top in order to get the edge with the minimum cost.

So far I've only been able to implement the minheap and sets for union-find, I am still unsure of the purpose of union-find and a sorting algorithm for the purpose of creating a spanning tree.

I would greatly appreciate any advice.

EDIT: I am not limited to union find, minheap, kruskals, and a sorting algorith, nor am I required to do any. These were just the items suggested by the instructor.

Solution

These two structures serve different purposes in the algorithm. Kruskal's algorithm works by adding the cheapest possible edge at each point that doesn't form a cycle. It can be shown using some not particularly complex math that this guarantees that the resulting spanning tree is minimal. The intuition behind why this works is as follows. Suppose that Kruskal's algorithm is not optimal and that there is a cheaper spanning tree. Sort all of the edges in that tree by weight, then compare those edges in sorted order to the edges chosen by Kruskal's algorithm in sorted order. Since we assume for contradiction that Kruskal's algorithm isn't optimal, there must be some place in the sequences where there's a disagreement. If in this disagreement Kruskal's algorithm has a lighter edge than the optimal solution, then we can make the optimal solution even better by adding that edge in, finding the cycle it creates, then deleting the heaviest edge in the cycle. That edge can't be the edge we just added, because otherwise that would have created a cycle in the MST produced by Kruskal's algorithm and Kruskal's algorithm never adds an edge that creates a cycle. So this means that Kruskal's algorithm must have diverged from the optimal solution by not adding some light edge. But the only reason Kruskal's algorithm skips an edge is if it creates a cycle, and this means that there must be a cycle in the optimal MST, also a contradiction. This means that our assumption was wrong and that Kruskal's algorithm must be optimal.

Hopefully, this motivates why Kruskal's algorithm needs the heap and the union-find structure. We need the heap so that we can get back all the edges in sorted order. If we don't visit the edges in this order, then the above proof breaks down and all bets are off. Interestingly, you don't actually need a heap; you just need some way of visiting all the edges in sorted order. If you want, you can just dump all the edges into a giant array and then sort the array. This doesn't change the runtime of the algorithm from the binary heap case if you use a fast sort.

The union-find structure is a bit trickier. At each point in Kruskal's algorithm you need to be able to tell whether adding an edge would create a cycle in the graph. One way to do this is to store a structure that keeps track of what nodes are already connected to one another. That way, when adding an edge, you can check whether the endpoints are already connected. If they are, then the edge would form a cycle and should be ignored. The union-find structure is a way of maintaining this information efficiently. In particular, its two operations - union and find - correspond to the act of connecting together two distinct groups of nodes that were previously not connected, as would be the case if you added an edge that connected two trees contained in different parts of the spanning forest. The find step gives you a way to check if two nodes are already connected; if so you should skip the current edge.

Hope this helps!