Search code examples
javasetminimum-spanning-treekruskals-algorithm

Problem removing from Java-HashSet while implementing Kruskal's algorithm


in the following code I am trying to implement Kruskal's algorithm to compute the minimal spanning tree of a graph.

The problem is that removing sets from the connected components does not work properly and that the case if(!startSet.equals(endSet)) always seems to be executed.

Now the code:

    import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Set;

import org.jgrapht.Graph;
import org.jgrapht.graph.AsSubgraph;
import org.jgrapht.graph.DefaultWeightedEdge;
import java.util.PriorityQueue;

public class mySpanningTree {
    
    //Receives a graph and computes a minimum spanning tree
    public static AsSubgraph<Integer, DefaultWeightedEdge> computeMST(Graph<Integer, DefaultWeightedEdge> graph) {
        AsSubgraph<Integer, DefaultWeightedEdge> tree = new AsSubgraph<Integer, DefaultWeightedEdge>(graph, graph.vertexSet(), new HashSet<DefaultWeightedEdge>());

        PriorityQueue<DefaultWeightedEdge> edgeQueue = new PriorityQueue<>(Comparator.comparingDouble(graph::getEdgeWeight));
        edgeQueue.addAll(graph.edgeSet());
        Set<Set<Integer>> connectedComponents = new HashSet<>();
        for(Integer i : graph.vertexSet()){
            Set<Integer> set = new HashSet<>();
            set.add(i);
            connectedComponents.add(set);
        }
        int n = tree.vertexSet().size() - 1;
        while (!edgeQueue.isEmpty() && tree.edgeSet().size() < n) {
            DefaultWeightedEdge edge = edgeQueue.poll();
            Integer start = graph.getEdgeSource(edge);
            Integer end = graph.getEdgeTarget(edge);
            Set<Integer> startSet =  null;
            Set<Integer> endSet = null;
            for(Set<Integer> set: connectedComponents){
                if(set.contains(start)){
                    startSet = set;
                }
                if(set.contains(end)){
                    endSet = set;
                }
            }
            if(!startSet.equals(endSet)){
                startSet.addAll(endSet);
                connectedComponents.remove(endSet);
                tree.addEdge(start, end, edge);
            }
            else{
            }
        }
        return tree;
    }

The idea of the code is to keep track of the connected components in a set. Whenever an edge connects two nodes from different connected components, I want to union the two connected components and store the result inside the set instead of the two components from before. Furthermore, the edge shall be added. Whenever an edge has two endpoints in one and the same set, it shall not be added (since adding it would create a cycle).

Any help is appreciated!


Solution

  • The value of connectedComponents is a mutable set, whose elements are also mutable sets. The problem is, that by destructively modifying the elements of connectedComponents without "notifying" the container, the container's data structures are silently corrupted.

    Hash sets are managed data structures, that use an integer value (the hashCode) derived from the contents of their elements to guide where they put the object internally in their private data structures. This mechanism gives fast insertion and retrieval (usually in the order of O(1)). However, for this to work, the hashCode must not change while the object is present in the container.

    In your code, the container is a HashSet<HashSet<Integer>> and the elements are HashSet<Integer> instances. For Java's sets, the hash code is computed based on the what the set currently contains. So, when you do

    Set<Integer> set = new HashSet<>();
    set.add(i);
    connectedComponents.add(set);
    

    the connectedComponents container computes a hash code from set and uses that to determine a place to put the object into in its internal data structures. However, when you do

    startSet.addAll(endSet);
    

    you mutate the contents of startSet which in turn causes its hash code to change. From connectedComponent's perspective, the object is now "in the wrong place" in its internal data structures, and hence it is likely, that it will not be found again. The important thing here is, that data structure like HashSet are based on equals and hashCode, and for many data structures (including HashSet) those are based on the actual contents, not merely on object identity.

    To address this problem, try removing the element set from connectedComponents before mutating it, and then add it back afterward:

    if(!startSet.equals(endSet)){
        connectedComponents.remove(startSet);  // Added back below ...
        connectedComponents.remove(endSet);
        startSet.addAll(endSet);
        connectedComponents.add(startSet);     // ... right here
        tree.addEdge(start, end, edge);
    }
    

    Alternatively, it might help to keep track of the element sets using a different container, like an IdentityHashMap (use whatever you like as the value). This would work here, since you do not need the container to track the element sets by their content (as HashSet implicitly does).

    For example:

    IdentityHashMap<Set<Integer>, Object> connectedComponents = new IdentityHashMap<>();
    
    ...
    
    for(Integer i : graph.vertexSet()){
        Set<Integer> set = new HashSet<>();
        set.add(i);
        connectedComponents.put(set, set);
    }
    
    ...
    
    for(Set<Integer> set: connectedComponents.keySet()) {
        ...
        if(set.contains(start)){
            startSet = set;
        }
        if(set.contains(end)){
            endSet = set;
        }
    }
    
    if(!startSet.equals(endSet)) {
        connectedComponents.remove(endSet);
        startSet.addAll(endSet);
        tree.addEdge(start, end, edge);
    }
    

    I am not sure right now, whether it should remain !startSet.equals(endSet) or whether you might want to use startSet != endSet with the modified code, though. The former version uses "deep equality" for the test (i.e., considers the actual contents the the sets involved) whereas the latter version works by object identity only, which I think would be more appropriate if you consider the IdentityHashMap route.