Search code examples
pythonmatplotlibscipynetworkxdendrogram

Plot the dendrogram of communities found by NetworkX Girvan-Newman algorithm


The Girvan-Newman algorithm for community detection in networks:

detects communities by progressively removing edges from the original graph. The algorithm removes the “most valuable” edge, traditionally the edge with the highest betweenness centrality, at each step. As the graph breaks down into pieces, the tightly knit community structure is exposed and the result can be depicted as a dendrogram.

In NetworkX the implementation returns an iterator over tuples of sets. First tuple is the first cut consisting of 2 communities, second tuple is the second cut consisting of 3 communities, etc., until the last tuple with n sets for n separate nodes (the leaves of the dendrogram).

import networkx as nx

G = nx.path_graph(10)
comp = nx.community.girvan_newman(G)
list(comp)

[({0, 1, 2, 3, 4}, {5, 6, 7, 8, 9}), ({0, 1}, {2, 3, 4}, {5, 6, 7, 8, 9}), ({0, 1}, {2, 3, 4}, {5, 6}, {8, 9, 7}), ({0, 1}, {2}, {3, 4}, {5, 6}, {8, 9, 7}), ({0, 1}, {2}, {3, 4}, {5, 6}, {7}, {8, 9}), ({0}, {1}, {2}, {3, 4}, {5, 6}, {7}, {8, 9}), ({0}, {1}, {2}, {3}, {4}, {5, 6}, {7}, {8, 9}), ({0}, {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8, 9}), ({0}, {1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9})]

Question is: how to plot this dendrogram?

I've looked at scipy.cluster.hierarchy.dendrogram but it expects a "linkage matrix" I'm guessing such as the one created by scipy.cluster.hierarchy.linkage, but I'm not sure how I would convert this list of tuples into this "linkage matrix".

So I'm asking how to draw this dendrogram, with/without the help of SciPy's dendrogram.


Solution

  • Following @ItamarMushkin I followed @mdml's answer with slight modifications and got what I wanted. At high level I'm turning NetworkX's Girvan-Newman iterator output into another DiGraph() I eventually want to see as a dendogram. Then I build Z, a linkage matrix I input to scipy.cluster.hierarchy.dendrogram, in the form of a edgelist that includes the actual height for each dendogram merge.

    Two modifications I had to make to @mdml's answer:

    1. Not that important: I sort the tuple-keys of nodes entering index
    2. More important: I add a get_merge_height function, which gives for each merge its unique height according to Girvan-Newman output order of edges removal. Otherwise, all merges of two nodes would be the same height in the dendrogram, all merges in the next level of merging two nodes and another one would be the same height, etc.

    I understand there may be some redundant iterations here, I haven't thought about optimization yet.

    import networkx as nx
    from itertools import chain, combinations
    import matplotlib.pyplot as plt
    from scipy.cluster.hierarchy import dendrogram
    
    # get simulated Graph() and Girvan-Newman communities list
    G = nx.path_graph(10)
    communities = list(nx.community.girvan_newman(G))
    
    # building initial dict of node_id to each possible subset:
    node_id = 0
    init_node2community_dict = {node_id: communities[0][0].union(communities[0][1])}
    for comm in communities:
        for subset in list(comm):
            if subset not in init_node2community_dict.values():
                node_id += 1
                init_node2community_dict[node_id] = subset
    
    # turning this dictionary to the desired format in @mdml's answer
    node_id_to_children = {e: [] for e in init_node2community_dict.keys()}
    for node_id1, node_id2 in combinations(init_node2community_dict.keys(), 2):
        for node_id_parent, group in init_node2community_dict.items():
            if len(init_node2community_dict[node_id1].intersection(init_node2community_dict[node_id2])) == 0 and group == init_node2community_dict[node_id1].union(init_node2community_dict[node_id2]):
                node_id_to_children[node_id_parent].append(node_id1)
                node_id_to_children[node_id_parent].append(node_id2)
    
    # also recording node_labels dict for the correct label for dendrogram leaves
    node_labels = dict()
    for node_id, group in init_node2community_dict.items():
        if len(group) == 1:
            node_labels[node_id] = list(group)[0]
        else:
            node_labels[node_id] = ''
    
    # also needing a subset to rank dict to later know within all k-length merges which came first
    subset_rank_dict = dict()
    rank = 0
    for e in communities[::-1]:
        for p in list(e):
            if tuple(p) not in subset_rank_dict:
                subset_rank_dict[tuple(sorted(p))] = rank
                rank += 1
    subset_rank_dict[tuple(sorted(chain.from_iterable(communities[-1])))] = rank
    
    # my function to get a merge height so that it is unique (probably not that efficient)
    def get_merge_height(sub):
        sub_tuple = tuple(sorted([node_labels[i] for i in sub]))
        n = len(sub_tuple)
        other_same_len_merges = {k: v for k, v in subset_rank_dict.items() if len(k) == n}
        min_rank, max_rank = min(other_same_len_merges.values()), max(other_same_len_merges.values())
        range = (max_rank-min_rank) if max_rank > min_rank else 1
        return float(len(sub)) + 0.8 * (subset_rank_dict[sub_tuple] - min_rank) / range
    
    # finally using @mdml's magic, slightly modified:
    G           = nx.DiGraph(node_id_to_children)
    nodes       = G.nodes()
    leaves      = set( n for n in nodes if G.out_degree(n) == 0 )
    inner_nodes = [ n for n in nodes if G.out_degree(n) > 0 ]
    
    # Compute the size of each subtree
    subtree = dict( (n, [n]) for n in leaves )
    for u in inner_nodes:
        children = set()
        node_list = list(node_id_to_children[u])
        while len(node_list) > 0:
            v = node_list.pop(0)
            children.add( v )
            node_list += node_id_to_children[v]
        subtree[u] = sorted(children & leaves)
    
    inner_nodes.sort(key=lambda n: len(subtree[n])) # <-- order inner nodes ascending by subtree size, root is last
    
    # Construct the linkage matrix
    leaves = sorted(leaves)
    index  = dict( (tuple([n]), i) for i, n in enumerate(leaves) )
    Z = []
    k = len(leaves)
    for i, n in enumerate(inner_nodes):
        children = node_id_to_children[n]
        x = children[0]
        for y in children[1:]:
            z = tuple(sorted(subtree[x] + subtree[y]))
            i, j = index[tuple(sorted(subtree[x]))], index[tuple(sorted(subtree[y]))]
            Z.append([i, j, get_merge_height(subtree[n]), len(z)]) # <-- float is required by the dendrogram function
            index[z] = k
            subtree[z] = list(z)
            x = z
            k += 1
    
    # dendrogram
    plt.figure()
    dendrogram(Z, labels=[node_labels[node_id] for node_id in leaves])
    plt.savefig('dendrogram.png')
    

    enter image description here