Search code examples
pythonpandasnetworkxbipartite

Bipartite projection and write to CSV with NetworkX -- how to speed up writing to handle large file


I have a pretty big file (3 million lines) with each line being a person-to-event relationship. Ultimate, I want to project this bipartite network onto a single-mode, weighted, network, and write it to a CSV file. I'm using NetworkX, and I've tested my code on a much smaller sample dataset, and it works as it should. However, when I scale up to my actual dataset, my computer just maxes out on memory and spins and spins, but doesn't make any progress.

I'm using an AWS EC2 machine with 32GB of memory.

After some sample testing, I'm pretty sure things are getting hung up in the final step after the graph has been projected, and it is being written to a CSV file. I've tried breaking up the file into chunks, but then I have a problem with missing edged, or correctly adding edgeweights together. But I think a better solution is going to be to find a way to speed up writing the projected graph to CSV.

More information about the original data: Some events have only 1 person attending them, while other events have 5,000 people attending them. Because of this, there will be a huge number of edges (I predict ~50M) created when the bipartite network is folded onto a single-mode network.

Code using NetworkX to Project Bipartite Network and Write to CSV

# import modules
import time
import csv
import networkx as nx
from networkx.algorithms import bipartite

startTime = datetime.datetime.now()

# rename files
infile = 'bipartite_network.csv'
name_outfile = infile.replace('.csv', '_nameFolded.csv.')
print 'Files renamed at: ' + str(datetime.datetime.now() - startTime)

# load CSV into a dict
with open(infile, 'rb') as csv_file:
    rawData = list(csv.DictReader(csv_file))
print 'Files loaded at: ' + str(datetime.datetime.now() - startTime)

# create edgelist for Name -x- Event relationships
edgelist = []
for i in rawData:
    edgelist.append(
    (i['Event'],
     i['Name'])    
    )
print 'Bipartite edgelist created at: ' + str(datetime.datetime.now() - startTime)

# deduplicate edgelist
edgelist = sorted(set(edgelist))
print 'Bipartite edgelist deduplicated at: ' + str(datetime.datetime.now() - startTime)

# create a unique list of Name and Event for nodes
Event = sorted(set([i['Event'] for i in rawData]))
Name = sorted(set([i['Name'] for i in rawData]))
print 'Node entities deduplicated at: ' + str(datetime.datetime.now() - startTime)

# add nodes and edges to a graph
B = nx.Graph()
B.add_nodes_from(Event, bipartite=0)
B.add_nodes_from(Name, bipartite=1)
B.add_edges_from(edgelist)
print 'Bipartite graph created at: ' + str(datetime.datetime.now() - startTime)

# create bipartite projection graph
name_nodes, event_nodes = bipartite.sets(B)
event_nodes = set(n for n,d in B.nodes(data=True) if d['bipartite']==0)
name_nodes = set(B) - event_nodes
name_graph = bipartite.weighted_projected_graph(B, name_nodes)
print 'Single-mode projected graph created at: ' + str(datetime.datetime.now() - startTime)

# write graph to CSV
nx.write_weighted_edgelist(name_graph, name_outfile, delimiter=',')
print 'Single-mode weighted edgelist to CSV: ' + str(datetime.datetime.now() -    startTime)

endTime = datetime.datetime.now()
print 'Run time: ' + str(endTime - startTime)

Using Pandas to Write the Projected Edgelist, but Missing Edge Weight?

I've thought about using pandas to write to name_graph to CSV. Would this be a good option for speeding up the writing to CSV part of the process?

import pandas as pd
df = pd.DataFrame(name_graph.edges(data=True))
df.to_csv('foldedNetwork.csv')

Solution

  • Here is what I suggested on the networkx-discuss mailing list:

    import networkx as nx
    
    B = nx.Graph()
    B.add_edge('a',1)
    B.add_edge('a',2)
    B.add_edge('b',1)
    B.add_edge('b',2)
    B.add_edge('b',3)
    B.add_edge('c',3)
    
    nodes = ['a','b','c']
    seen = set()
    for u in nodes:
    #    seen=set([u]) # print both u-v, and v-u
        seen.add(u) # don't print v-u
        unbrs = set(B[u])
        nbrs2 = set((n for nbr in unbrs for n in B[nbr])) - seen
        for v in nbrs2:
            vnbrs = set(B[v])
            common = unbrs & vnbrs
            weight = len(common)
            print("%s, %s, %d"%(u,v,weight))