Search code examples
pythonnetworkxsimilarity

Weighted Jaccard Similarity for Two Networks with Networkx


I am using Python networkx to compute the Jaccard similarity of the following 2 networks:

G.edges(data=True)
OutEdgeDataView([('v1', 'a', {'weight': 1}), ('v1', 'b', {'weight': 1}), ('a', 'd', {'weight': 1}), ('a', '11', {'weight': 7}), ('b', 'v4', {'weight': 1}), ('b', 'v9', {'weight': 1}), ('v9', 'b', {'weight': 1})])

K.edges(data=True)
OutEdgeDataView([('v1', 'asssssss', {'weight': 1}), ('v1', 'b', {'weight': 1}), ('asssssss', 'd', {'weight': 1}), ('asssssss', '11', {'weight': 7}), ('b', 'asssssss', {'weight': 10}), ('b', 'd', {'weight': 10}), ('v9', 'b', {'weight': 1})])

I am using the following solution kindly provided in this answer:

def jaccard_similarity(g, h):
    i = set(g).intersection(h)
    return round(len(i) / (len(g) + len(h) - len(i)),6)

jaccard_similarity(G.edges(), K.edges())
0.166667

However, I need to extend this calculation to a weighted jaccard measure (distance or similarity) that takes into account arc weights and works both for directed and undirected networks, which could have different edges and nodes.

Can you help me?


Solution

  • For the weighted Jaccard, we need to define a vector. If we treat union of all edges as our vector dimensions, then our weights for each graph form a coordinate in that space.

    def weighted_jaccard(g1, g2):
        edges = set(g1.edges()).union(g2.edges())
        mins, maxs = 0, 0
        for edge in edges:
            weight1 = g1.get_edge_data(*edge, {}).get('weight', 0)
            weight2 = g2.get_edge_data(*edge, {}).get('weight', 0)
            mins += min(weight1, weight2)
            maxs += max(weight1, weight2)
        return mins / maxs
    

    This definition is consistent with the non-weighted Jaccard when all of the weights are 1.

    It's worth pointing out that once you've defined your vectors in this way, you could apply pretty much any Rn distance metric.