Search code examples
pythonpivot-tablenetworkxadjacency-matrixadjacency-list

Python code for calculation of very large adjacency matrix crashes using networkx MultiDiGraph


I need to calculate the adjacency matrix (in flat format) of a very large graph. Number nodes is 54327 and number edges is 46 million. The input are 46 million edges, so input looks like

1 6
2 7
1 6
3 8
...

meaning node 1 connects to 6, node 2 to 7, with possible repeats. The adjacency matrix in this case would look like

  1 2 3 6 7 8
1 0 0 0 2 0 0
2 0 0 0 0 1 0
3 0 0 0 0 0 1
6 1 0 0 0 0 0
7 0 2 0 0 0 0
8 0 0 1 0 0 0

And in fact i just need the flat version for non-zero entries so:

node_x node_y count_edges
1       6       2
2       7       1
3       8       1
6       1       2
7       2       1
8       3       1

With networkx (or with a crosstab) the python code crashes with out of memory (on both a 8GB RAM personal machine or a 64GB RAM linux server).

Doing it the slow way, simply with for loops works, but takes 24+ hours to run.

Here is the full test code below, any ideas on how to get networkx (or crosstab or any clever alternative to run)?

import random
import pandas as pd
import networkx as nx
import sys

number_nodes = 54327 # real one is 54,327
number_edges_for_all_nodes = 46000000 # real one is around 46 million

print("number_nodes=",number_nodes)
print("number_edges_for_all_nodes=",number_edges_for_all_nodes)

# Create a list of possible list of nodes
list_nodes = [f'{i}' for i in range(1, number_nodes)]

# Create random combinations of possible values.
nodeid_x = random.choices(list_nodes, k=number_edges_for_all_nodes)
nodeid_y = random.choices(list_nodes, k=number_edges_for_all_nodes)

# This would look like this in a dataframe.
df = pd.DataFrame({'nodeid_x': nodeid_x, 'nodeid_y': nodeid_y})

# remove self nodes
df = df.query('nodeid_x != nodeid_y')

# df
print("Graph in df ->")
print(df.head(10))

print("Create edges for MultiDiGraph")            
edges = df[['nodeid_x', 'nodeid_y']].to_numpy(dtype=int)  
# 
print("Create nodes for MultiDiGraph")            
nodes = sorted(set(node for edge in edges for node in edge))
# create graph that allows parallel connections, so that one node may be connected twice to another       
# Multi means parallel connections, Di means directional, since we want both directions
G = nx.MultiDiGraph()
print("Create add_nodes_from for MultiDiGraph")            
G.add_nodes_from(nodes)
print("Create add_edges_from for MultiDiGraph")            
G.add_edges_from(edges)
# memory usage
edge = sum([sys.getsizeof(e) for e in G.edges])
node = sum([sys.getsizeof(n) for n in G.nodes])
print("Total memory graph GB:", (edge + node)/1024/1024/1024)

# count connections, store to int to save space            
print("Create to_pandas_adjacency for MultiDiGraph")            
df = nx.to_pandas_adjacency(G, dtype=int)  
print("Total memory adjacency GB:", df.memory_usage(deep=True).sum()/1024/1024/1024)
print(df)

# Reset the index to flatten the DataFrame from a matrix to a long table
print("Reset the index to flatten the DataFrame from a matrix to a long table")            
df = df.stack().reset_index() 
df.index = df.index.astype('int32', copy = False) 

# rename the column/row edges
print("rename the column/row edges")            
df.rename(columns={'level_0': 'nodeid_x'}, inplace=True)
df.rename(columns={'level_1': 'nodeid_y'}, inplace=True)  
                        
# rename crosstab intersection column
print("rename crosstab intersection column")
df.rename(columns={0: 'count_intersections'}, inplace=True)

# change count_intersections to int to save memory
print("change count_intersections to int to save memory")
df['count_intersections'] = df['count_intersections'].astype('int32', copy = False)
        
# now drop rows for which the nodeid_x = nodeid_y
print("now drop rows for which the nodeid_x = nodeid_y")
df = df.query('nodeid_x != nodeid_y')

# now drop zero intersections rows
print("now drop zero intersections rows")
df = df.query('count_intersections != 0')  

# sort
df.sort_values(by=['count_intersections', 'nodeid_x', 'nodeid_y'], ascending=False, inplace = True)

print("len(df)=",len(df))

print("pandas_adjacency matrix flat->")
print(df.head(9))

Networkx, Crosstab, slow for loops


Solution

  • Maybe I miss something, buy you can use .groupby + .sum here:

    out = (
        df.assign(count_intersections=1)
        .groupby(["nodeid_x", "nodeid_y"], as_index=False)["count_intersections"]
        .sum()
    )
    out.sort_values(
        by=["count_intersections", "nodeid_x", "nodeid_y"], ascending=False, inplace=True
    )
    print(out.head(10))
    

    Prints (running with random.seed(42), with 46mio edges under one minute):

             nodeid_x nodeid_y  count_intersections
    41523794     5587    29315                    4
    40851660    53761    14855                    4
    40587824    53478    42683                    4
    40429793    53307       75                    4
    36763731     4938    42230                    4
    30719061     4290    32316                    4
    6800705     17279    42566                    4
    45616995     9972    39801                    3
    45613051     9968     5797                    3
    45597210     9951    10498                    3