Search code examples
pythonindexingbipartite

Optimizing an indexing script for a bipartite graph


I'm analyzing network and semantic values of tweets I downloaded on a given subject and geolocation, using bipartite graph.

Using Python, I create a .net file containing the 2 sets of nodes and the edges. This file is a merge of files I created separately: 2 sets of Vertices and the Edges. Problem is creating the Edges component of the .net file.

I have 3 files:

  • tweeterers.csv with the senders/tweeterers ("number/ID" and "name")
  • words.csv, with semantic tags/words I extracted from the tweets. Format is "number/ID" and "name", with "number" starting from the last "number" of the above file. There are 0 to 6 words per row
  • Names_Text_full_clean.csv, with tweeterers and words. Each row contains 1 tweeterer's name and 0 to 6 words This file will give me the association between tweeterers and words, for the graph.

I basically read each tweeterer, read a word, read if there is an association. If yes, I write the association (that is an edge). It's triple loop. This is painfully slow with medium size networks: a network with ~650 nodes and ~18000 edges took me almost 2 days on a Mac Mini 2.7GHz quadcore.

Any help to speed it up will be highly appreciated!

The following is the code:

import csv # csv library is to handle csv files

# open the twetterers file and make it available in 'reader1'
file_read1 = open('tweeterers.csv', 'rU')
    reader1 = csv.reader(file_read1)

# open the file for writing and make it available in 'writer'
file_write=open('edges.csv', 'wb')
writer=csv.writer(file_write)


for sender in reader1:
    file_read2 = open('words.csv', 'rU')
    reader2 = csv.reader(file_read2)
    for word in reader2:
        file_read = open('Names_Text_full_clean.csv', 'rU')
        reader = csv.reader(file_read)
        for match in reader:
            for elem in range (1,len(match)):
                if sender[1] == match [0]:
                    if word [1] == match [elem]:
                        a = sender[0],word[0]
                        writer.writerow(a)
                        print "I wrote a record: it's: ",a

file_read.close()
file_read1.close()
file_read2.close()
file_write.close()

Solution

  • Use dictionaries. For example, a first step would be to read Names_Text_full_clean.csv only once and store the result in a dictionary, indexed by match[0]. Because there might be several times the same match[0] you need to store as a value the list of the possibly multiple match objects.

    import collections
    by_sender = collections.defaultdict(list)
    file_read = open('Names_Text_full_clean.csv', 'rU')
    reader = csv.reader(file_read)
    for match in reader:
        by_sender[match[0]].append(match)
    

    Then in the nested loops, you can replace

        for match in reader:
            if sender[1] == match [0]:
    

    with the following loop, which is probably hundreds of times smaller:

        for match in by_sender[sender[1]]:
    

    A further optimization would be to not store match in the list by_sender[match[0]], but to store set(match[1:]). Indeed, you're only going to look if a particular entry (word[1] in this case) is equal to any one of the items in match[1:]. Instead of looping to figure this out, it can be done with just word[1] in my_set.

    This is probably enough, but the "final goal" would be to read all three files once only. You would store the content of two of the files in some suitable dictionaries, and do dictionary lookups only when walking over the third file (or "set lookups", which are very fast too).