I need to calculate the adjacency matrix (in flat format) of a very large graph. Number nodes is 54327 and number edges is 46 million. The input are 46 million edges, so input looks like
1 6
2 7
1 6
3 8
...
meaning node 1 connects to 6, node 2 to 7, with possible repeats. The adjacency matrix in this case would look like
1 2 3 6 7 8
1 0 0 0 2 0 0
2 0 0 0 0 1 0
3 0 0 0 0 0 1
6 1 0 0 0 0 0
7 0 2 0 0 0 0
8 0 0 1 0 0 0
And in fact i just need the flat version for non-zero entries so:
node_x node_y count_edges
1 6 2
2 7 1
3 8 1
6 1 2
7 2 1
8 3 1
With networkx (or with a crosstab) the python code crashes with out of memory (on both a 8GB RAM personal machine or a 64GB RAM linux server).
Doing it the slow way, simply with for loops works, but takes 24+ hours to run.
Here is the full test code below, any ideas on how to get networkx (or crosstab or any clever alternative to run)?
import random
import pandas as pd
import networkx as nx
import sys
number_nodes = 54327 # real one is 54,327
number_edges_for_all_nodes = 46000000 # real one is around 46 million
print("number_nodes=",number_nodes)
print("number_edges_for_all_nodes=",number_edges_for_all_nodes)
# Create a list of possible list of nodes
list_nodes = [f'{i}' for i in range(1, number_nodes)]
# Create random combinations of possible values.
nodeid_x = random.choices(list_nodes, k=number_edges_for_all_nodes)
nodeid_y = random.choices(list_nodes, k=number_edges_for_all_nodes)
# This would look like this in a dataframe.
df = pd.DataFrame({'nodeid_x': nodeid_x, 'nodeid_y': nodeid_y})
# remove self nodes
df = df.query('nodeid_x != nodeid_y')
# df
print("Graph in df ->")
print(df.head(10))
print("Create edges for MultiDiGraph")
edges = df[['nodeid_x', 'nodeid_y']].to_numpy(dtype=int)
#
print("Create nodes for MultiDiGraph")
nodes = sorted(set(node for edge in edges for node in edge))
# create graph that allows parallel connections, so that one node may be connected twice to another
# Multi means parallel connections, Di means directional, since we want both directions
G = nx.MultiDiGraph()
print("Create add_nodes_from for MultiDiGraph")
G.add_nodes_from(nodes)
print("Create add_edges_from for MultiDiGraph")
G.add_edges_from(edges)
# memory usage
edge = sum([sys.getsizeof(e) for e in G.edges])
node = sum([sys.getsizeof(n) for n in G.nodes])
print("Total memory graph GB:", (edge + node)/1024/1024/1024)
# count connections, store to int to save space
print("Create to_pandas_adjacency for MultiDiGraph")
df = nx.to_pandas_adjacency(G, dtype=int)
print("Total memory adjacency GB:", df.memory_usage(deep=True).sum()/1024/1024/1024)
print(df)
# Reset the index to flatten the DataFrame from a matrix to a long table
print("Reset the index to flatten the DataFrame from a matrix to a long table")
df = df.stack().reset_index()
df.index = df.index.astype('int32', copy = False)
# rename the column/row edges
print("rename the column/row edges")
df.rename(columns={'level_0': 'nodeid_x'}, inplace=True)
df.rename(columns={'level_1': 'nodeid_y'}, inplace=True)
# rename crosstab intersection column
print("rename crosstab intersection column")
df.rename(columns={0: 'count_intersections'}, inplace=True)
# change count_intersections to int to save memory
print("change count_intersections to int to save memory")
df['count_intersections'] = df['count_intersections'].astype('int32', copy = False)
# now drop rows for which the nodeid_x = nodeid_y
print("now drop rows for which the nodeid_x = nodeid_y")
df = df.query('nodeid_x != nodeid_y')
# now drop zero intersections rows
print("now drop zero intersections rows")
df = df.query('count_intersections != 0')
# sort
df.sort_values(by=['count_intersections', 'nodeid_x', 'nodeid_y'], ascending=False, inplace = True)
print("len(df)=",len(df))
print("pandas_adjacency matrix flat->")
print(df.head(9))
Networkx, Crosstab, slow for loops
Maybe I miss something, buy you can use .groupby
+ .sum
here:
out = (
df.assign(count_intersections=1)
.groupby(["nodeid_x", "nodeid_y"], as_index=False)["count_intersections"]
.sum()
)
out.sort_values(
by=["count_intersections", "nodeid_x", "nodeid_y"], ascending=False, inplace=True
)
print(out.head(10))
Prints (running with random.seed(42)
, with 46mio edges under one minute):
nodeid_x nodeid_y count_intersections
41523794 5587 29315 4
40851660 53761 14855 4
40587824 53478 42683 4
40429793 53307 75 4
36763731 4938 42230 4
30719061 4290 32316 4
6800705 17279 42566 4
45616995 9972 39801 3
45613051 9968 5797 3
45597210 9951 10498 3