Fast Pathfinder associative network algorithm (PFNET) in Python

I've been trying to implement a "Fast Pathfinder" network pruning algorithm from https://doi.org/10.1016/j.ipm.2007.09.005 in Python/networkX, and have finally stumbled on something that is returning something that looks more or less right.

I'm not quite competent enough to test if the results are consistently (or ever, for that matter) correct though. Especially for directed graphs I have my doubts, and I'm unsure if the original is even intended to work for directed graphs. I have not found a Python implementation for any pathfinder network algorithms yet, but if there is an existing alternative to use I would also be interested for comparing results. I know there is an implementation in R (https://rdrr.io/cran/comato/src/R/pathfinder.r) where I took some inspiration as well.

Based on my best (read: poor) understanding, the algorithm described in the paper uses a distance matrix of shortest paths generated by the Floyd-Warshall algorithm, and compares those distances to the weighted adjacency matrix, picking only the matches as links. The intuition for the expected result in the undirected case is the union of all edges in all of its possible minimum spanning trees.

That is what I am attempting to emulate with the below function:

def minimal_pathfinder(G, r = float("inf")):
    """ 
    Args:
    -----
    G [networkX graph]:
        Graph to filter links from.
    r [float]:
        "r" parameter as in the paper.

    Returns:
    -----
    PFNET [networkX graph]:
        Graph containing only the PFNET links.
    """
    
    import networkx as nx
    from collections import defaultdict
    
    H = G.copy()
    
    # Initialize adjacency matrix W
    W = defaultdict(lambda: defaultdict(lambda: float("inf")))
    
    # Set diagonal to 0
    for u in H.nodes():
        W[u][u] = 0 
    
    # Get weights and set W values
    for i, j, d in H.edges(data=True):
        W[i][j] = d['weight'] # Add weights to W
        
    # Get shortest path distance matrix D
    dist = nx.floyd_warshall_predecessor_and_distance(H, weight='weight')[1]
    
    # Iterate over all triples to get values for D
    for k in H.nodes():
        for i in H.nodes():
            for j in H.nodes():
                if r == float("inf"): # adapted from the R-comato version which does a similar check
                # Discard non-shortest paths
                    dist[i][j] = min(dist[i][j], (dist[i][k] + dist[k][j]))
                else:
                    dist[i][j] = min(dist[i][j], (((dist[i][k]) ** r) + ((dist[k][j]) ** r )) ** (1/r))
                
    # Check for type; set placeholder for either case
    if not H.is_directed():
        PFNET = nx.Graph()
        PFNET.add_nodes_from(H.nodes(data=True))
    else:
        PFNET = nx.DiGraph()
        PFNET.add_nodes_from(H.nodes(data=True))
        
    # Add links D_ij only if == W_ij
    for i in H.nodes():
        for j in H.nodes():
            if dist[i][j] == W[i][j]: # If shortest path distance equals distance in adjacency
                if dist[i][j] == float("inf"): # Skip infinite path lengths
                    pass
                elif i == j: # Skip the diagonal
                    pass
                else: # Add link to PFNET
                    weight = dist[i][j]
                    PFNET.add_edge(i, j, weight=weight)
                    
    return PFNET

I've tested this with a bunch of real networks (both directed and undirected) and randomly generated networks, both cases ranging from 20ish nodes up to around 300 nodes, maximum few thousand edges (e.g. complete graphs, connected caveman graphs). In all cases it returns something, but I have little confidence the results are correct. As I find no other implementations I'm unsure how to verify this is working consistently (I'm not really using any other languages at all).

I am fairly sure there is still something wrong with this but I am unsure of what it might be.

Simple use case:

G = nx.complete_graph(50) # Generate a complete graph

# Add random weights
for (u,v,w) in G.edges(data=True):
    w['weight'] = np.random.randint(1,20)
    
PFNET = minimal_pathfinder(G)

print(nx.info(G))
print(nx.info(PFNET))

Output:

Graph with 50 nodes and 1225 edges
Graph with 50 nodes and 236 edges

I was wondering about two things:

1. Any idea what might be wrong with the implementation? Should I have confidence in the results?

Any idea how this might converted to work with similarity data instead of distances?

To the second I considered normalizing the weights to 0-1 range and converting all the distances to similarities by 1 - distance. But I am unsure if this is theoretically valid, and was hoping for a second opinion.

EDIT: I possibly discovered solution to Q2. in the original paper: change float("inf") to float("-inf") and change min to max in the first loop. From the authors' footnote:

Actually, using similarities or distances has no influence at all in our proposal. In case of using similarities, we would only need to replace MIN by MAX, ’>’ by ’<’, and use r = -inf to mimic the MIN function instead of the MAX function in the Fast Pathfinder algorithm.

Any inputs much appreciated, thanks!

EDIT (adding example of how it goes wrong from here) per comment, using the "example from a datafile" section:

Adjacency in starting graph:

matrix([[0, 1, 4, 2, 2],
        [1, 0, 2, 3, 0],
        [4, 2, 0, 3, 1],
        [2, 3, 3, 0, 3],
        [2, 0, 1, 3, 0]], dtype=int32)

And after pruning with the function, converting first into a networkX undirected graph:

matrix([[0, 1, 0, 2, 2],
        [1, 0, 2, 3, 0],
        [0, 2, 0, 3, 1],
        [2, 3, 3, 0, 3],
        [2, 0, 1, 3, 0]], dtype=int32)

It seems to drop only the highest weight overall leaving all other edges. Since the expected result is in an edgelist on the linked example, here's the edgelist of the result I obtain as well:

source  target  weight
1       2       1
1       4       2
1       5       2
2       3       2
2       4       3 
3       4       3
3       5       1
4       5       3

Solution

Below is a possible implementation of Fast-Pathfinder in Python using the networkx library. Note:

the implementation corresponds to the paper.
it is inspired from the C implementation found in GitHub.
only the maximum variant is implemented, where the input matrix is a similarity matrix and not a distance matrix (edges with the highest values are kept).

def fast_pfnet(G, q, r):
    
    s = G.number_of_nodes()
    weights_init = np.zeros((s,s))
    weights = np.zeros((s,s))
    hops = np.zeros((s,s))
    pfnet = np.zeros((s,s))

    for i, j, d in G.edges(data=True):
        weights_init[i,j] = d['weight']
        weights_init[j,i] = d['weight']

    for i in range(s):
        for j in range(s):
            weights[i,j] = -weights_init[i,j]
            if i==j:
                hops[i,j] = 0
            else:
                hops[i,j] = 1

    def update_weight_maximum(i, j, k, wik, wkj, weights, hops, p):
        if p<=q:
            if r==0:
                # r == infinity
                dist = max(wik, wkj)
            else:
                dist = (wik**r + wkj**r) ** (1/r)

            if dist < weights[i,j]:
                weights[i,j] = dist
                weights[j,i] = dist
                hops[i,j] = p
                hops[j,i] = p
                
    def is_equal(a, b):
        return abs(a-b)<0.00001

    for k in range(s):
        for i in range(s):
            if i!=k:
                beg = i+1
                for j in range(beg, s):
                    if j!=k:
                        update_weight_maximum(i, j, k, weights_init[i,k], weights_init[k,j], weights, hops, 2)
                        update_weight_maximum(i, j, k, weights[i,k], weights[k,j], weights, hops, hops[i,k]+hops[k,j])

    for i in range(s):
        for j in range(s): # Possible optimisation: in case of symmetrical matrices, we do not need to go from 0 to s but from i+1 to s
            if not is_equal(weights_init[i,j], 0):
                if is_equal(weights[i,j], -weights_init[i,j]):
                    pfnet[i,j] = weights_init[i,j]
                else:
                    pfnet[i,j] = 0

    return nx.from_numpy_matrix(pfnet)

Usage:

m = np.matrix([[0, 1, 4, 2, 2],
        [1, 0, 2, 3, 0],
        [4, 2, 0, 3, 1],
        [2, 3, 3, 0, 3],
        [2, 0, 1, 3, 0]], dtype=np.int32)

G = nx.from_numpy_matrix(m)

# Fast-PFNET parameters set to emulate MST-PFNET
# This variant is OK for other parameters (q, r) but for the ones below
# it is faster to implement the MST-PFNET variant instead.
q = G.number_of_nodes()-1
r = 0

P = fast_pfnet(G, q, r)

list(P.edges(data=True))

This should return:

[(0, 2, {'weight': 4.0}),
 (1, 3, {'weight': 3.0}),
 (2, 3, {'weight': 3.0}),
 (3, 4, {'weight': 3.0})]

Which is similar to what is shown on the website (see the example in the section "After the application of Pathfinder").