Search code examples
pythonoptimizationgraphnetworkxgraphml

Optimising memory utilization in python networkx


I am analysing a network of blogs by making a tag network(Edges between blogs which share common tags with weight=no of shared tags/total no of tags which are in either. There are around 10000 nodes in the graph. I need to convert the raw data into GraphML format and for that purpose, I am using python networkx. But it is running out of memory. I am new with python so can anyone please tell me what I am doing wrong here.(Or is it a hardware problem? my system is i3, 3GB memory)

#!/usr/bin/env python
import sys
import networkx as nx
G=nx.Graph()
tags=[]
for line in open(sys.argv[1]):#Each blog has all its tags in a single line
    tags.append(set(line.split(',')))#tags are separated by comma.
for i in xrange(len(tags)):
    G.add_node(i+1)
for i in xrange(len(tags)):
    for j in xrange(i+1,len(tags)):
        p=len(tags[i].intersection(tags[j]))
        q=len(tags[i].union(tags[j]))
        if p!=0 and q!=0:
            G.add_edge(i+1,j+1,weight=float(p)/q)
nx.write_graphml(G,sys.argv[1]+'.graphml')

Solution

  • The only improvement I can see is instead of making a 2 D list for tags, I can use a binary flag bit for each tag. So its memory requirement is lower(since tags can be pretty long and the number of distinct tags are only ~150 so there is lots of repetition). This doesn't change much. The problem is at the write_graphml function like Aric mentioned in the comments. I was finally able too run it on a 16 GB machine & it took ~9.5 GB.
    PS:If anyone knows any better technique, please tell me.