I've been trying to implement a "Fast Pathfinder" network pruning algorithm from https://doi.org/10.1016/j.ipm.2007.09.005 in Python/networkX, and have finally stumbled on something that is returning something that looks more or less right.
I'm not quite competent enough to test if the results are consistently (or ever, for that matter) correct though. Especially for directed graphs I have my doubts, and I'm unsure if the original is even intended to work for directed graphs. I have not found a Python implementation for any pathfinder network algorithms yet, but if there is an existing alternative to use I would also be interested for comparing results. I know there is an implementation in R (https://rdrr.io/cran/comato/src/R/pathfinder.r) where I took some inspiration as well.
Based on my best (read: poor) understanding, the algorithm described in the paper uses a distance matrix of shortest paths generated by the Floyd-Warshall algorithm, and compares those distances to the weighted adjacency matrix, picking only the matches as links. The intuition for the expected result in the undirected case is the union of all edges in all of its possible minimum spanning trees.
That is what I am attempting to emulate with the below function:
def minimal_pathfinder(G, r = float("inf")):
"""
Args:
-----
G [networkX graph]:
Graph to filter links from.
r [float]:
"r" parameter as in the paper.
Returns:
-----
PFNET [networkX graph]:
Graph containing only the PFNET links.
"""
import networkx as nx
from collections import defaultdict
H = G.copy()
# Initialize adjacency matrix W
W = defaultdict(lambda: defaultdict(lambda: float("inf")))
# Set diagonal to 0
for u in H.nodes():
W[u][u] = 0
# Get weights and set W values
for i, j, d in H.edges(data=True):
W[i][j] = d['weight'] # Add weights to W
# Get shortest path distance matrix D
dist = nx.floyd_warshall_predecessor_and_distance(H, weight='weight')[1]
# Iterate over all triples to get values for D
for k in H.nodes():
for i in H.nodes():
for j in H.nodes():
if r == float("inf"): # adapted from the R-comato version which does a similar check
# Discard non-shortest paths
dist[i][j] = min(dist[i][j], (dist[i][k] + dist[k][j]))
else:
dist[i][j] = min(dist[i][j], (((dist[i][k]) ** r) + ((dist[k][j]) ** r )) ** (1/r))
# Check for type; set placeholder for either case
if not H.is_directed():
PFNET = nx.Graph()
PFNET.add_nodes_from(H.nodes(data=True))
else:
PFNET = nx.DiGraph()
PFNET.add_nodes_from(H.nodes(data=True))
# Add links D_ij only if == W_ij
for i in H.nodes():
for j in H.nodes():
if dist[i][j] == W[i][j]: # If shortest path distance equals distance in adjacency
if dist[i][j] == float("inf"): # Skip infinite path lengths
pass
elif i == j: # Skip the diagonal
pass
else: # Add link to PFNET
weight = dist[i][j]
PFNET.add_edge(i, j, weight=weight)
return PFNET
I've tested this with a bunch of real networks (both directed and undirected) and randomly generated networks, both cases ranging from 20ish nodes up to around 300 nodes, maximum few thousand edges (e.g. complete graphs, connected caveman graphs). In all cases it returns something, but I have little confidence the results are correct. As I find no other implementations I'm unsure how to verify this is working consistently (I'm not really using any other languages at all).
I am fairly sure there is still something wrong with this but I am unsure of what it might be.
Simple use case:
G = nx.complete_graph(50) # Generate a complete graph
# Add random weights
for (u,v,w) in G.edges(data=True):
w['weight'] = np.random.randint(1,20)
PFNET = minimal_pathfinder(G)
print(nx.info(G))
print(nx.info(PFNET))
Output:
Graph with 50 nodes and 1225 edges
Graph with 50 nodes and 236 edges
I was wondering about two things:
1. Any idea what might be wrong with the implementation? Should I have confidence in the results?
To the second I considered normalizing the weights to 0-1 range and converting all the distances to similarities by 1 - distance. But I am unsure if this is theoretically valid, and was hoping for a second opinion.
EDIT: I possibly discovered solution to Q2. in the original paper: change float("inf") to float("-inf") and change min to max in the first loop. From the authors' footnote:
Actually, using similarities or distances has no influence at all in our proposal. In case of using similarities, we would only need to replace MIN by MAX, ’>’ by ’<’, and use r = -inf to mimic the MIN function instead of the MAX function in the Fast Pathfinder algorithm.
Any inputs much appreciated, thanks!
EDIT (adding example of how it goes wrong from here) per comment, using the "example from a datafile" section:
Adjacency in starting graph:
matrix([[0, 1, 4, 2, 2],
[1, 0, 2, 3, 0],
[4, 2, 0, 3, 1],
[2, 3, 3, 0, 3],
[2, 0, 1, 3, 0]], dtype=int32)
And after pruning with the function, converting first into a networkX undirected graph:
matrix([[0, 1, 0, 2, 2],
[1, 0, 2, 3, 0],
[0, 2, 0, 3, 1],
[2, 3, 3, 0, 3],
[2, 0, 1, 3, 0]], dtype=int32)
It seems to drop only the highest weight overall leaving all other edges. Since the expected result is in an edgelist on the linked example, here's the edgelist of the result I obtain as well:
source target weight
1 2 1
1 4 2
1 5 2
2 3 2
2 4 3
3 4 3
3 5 1
4 5 3
Below is a possible implementation of Fast-Pathfinder in Python using the networkx
library. Note:
def fast_pfnet(G, q, r):
s = G.number_of_nodes()
weights_init = np.zeros((s,s))
weights = np.zeros((s,s))
hops = np.zeros((s,s))
pfnet = np.zeros((s,s))
for i, j, d in G.edges(data=True):
weights_init[i,j] = d['weight']
weights_init[j,i] = d['weight']
for i in range(s):
for j in range(s):
weights[i,j] = -weights_init[i,j]
if i==j:
hops[i,j] = 0
else:
hops[i,j] = 1
def update_weight_maximum(i, j, k, wik, wkj, weights, hops, p):
if p<=q:
if r==0:
# r == infinity
dist = max(wik, wkj)
else:
dist = (wik**r + wkj**r) ** (1/r)
if dist < weights[i,j]:
weights[i,j] = dist
weights[j,i] = dist
hops[i,j] = p
hops[j,i] = p
def is_equal(a, b):
return abs(a-b)<0.00001
for k in range(s):
for i in range(s):
if i!=k:
beg = i+1
for j in range(beg, s):
if j!=k:
update_weight_maximum(i, j, k, weights_init[i,k], weights_init[k,j], weights, hops, 2)
update_weight_maximum(i, j, k, weights[i,k], weights[k,j], weights, hops, hops[i,k]+hops[k,j])
for i in range(s):
for j in range(s): # Possible optimisation: in case of symmetrical matrices, we do not need to go from 0 to s but from i+1 to s
if not is_equal(weights_init[i,j], 0):
if is_equal(weights[i,j], -weights_init[i,j]):
pfnet[i,j] = weights_init[i,j]
else:
pfnet[i,j] = 0
return nx.from_numpy_matrix(pfnet)
Usage:
m = np.matrix([[0, 1, 4, 2, 2],
[1, 0, 2, 3, 0],
[4, 2, 0, 3, 1],
[2, 3, 3, 0, 3],
[2, 0, 1, 3, 0]], dtype=np.int32)
G = nx.from_numpy_matrix(m)
# Fast-PFNET parameters set to emulate MST-PFNET
# This variant is OK for other parameters (q, r) but for the ones below
# it is faster to implement the MST-PFNET variant instead.
q = G.number_of_nodes()-1
r = 0
P = fast_pfnet(G, q, r)
list(P.edges(data=True))
This should return:
[(0, 2, {'weight': 4.0}),
(1, 3, {'weight': 3.0}),
(2, 3, {'weight': 3.0}),
(3, 4, {'weight': 3.0})]
Which is similar to what is shown on the website (see the example in the section "After the application of Pathfinder").