Search code examples
pythongraphnetworkxedges

Populating Networkx Graph with info iteratively


I have been trying to develop a graph structure that will link entities according to co-mentioned features between them, e.g. 2 places are linked if co-mentioned in an article.

I have managed to do so but I have been having problems to iteratively populate an edge with new information keeping the already existing one.

My approach (since I haven't found anything related anywhere) is to append existing information to a list, append the new link in the list and assign that list to the appropriate feature.

    temp = []
    if G.has_edge(i[z],i[j]):
        temp.append(G[i[z]][i[j]]['article'])
        temp.append(url[index])
        G[i[z]][i[j]]['article'] = temp
    else:
        print "Create edge!"
        G.add_edge(i[z],i[j], article=url)
    del temp[:]

As you can see above, as there are many links to be populated, I defined a dedicated list (temp), loaded the old contents of a link's variable called article (if the link does not exist I create a link and add as first value the url that "brought" 2 places together.

My problem is that while I empty the list each time in order to be empty when a new pair comes in when I try to see a link's urls I get something like this:

{'article': [[...], u'http://www.huffingtonpost.co.uk/.../']

It seems like I am keeping only the last link as each time I delete the temporary list's contents but I cannot find a better way to do so without declaring an unnecessary bunch of temp lists.

Any ideas?

Thank you for your time.


Solution

  • TL/DR summary: change your entire snippet to

    if G.has_edge(i[z],i[j]):
            G[i[z]][i[j]]['article'].append(url[index])
        else:
            G.add_edge(i[z],i[j], article=[url])
    

    Here's what's going on:

    When you create the edge the first time you use

    G.add_edge(i[z],i[j], article=url)
    

    So it's a string. But later when you do

    G[i[z]][i[j]]['article'] = temp
    

    you've defined temp to be a list whose first element is G[i[z]][i[j]]['article']. So G[i[z]][i[j]]['article'] is now a list with two elements, the first of which is the old value for G[i[z]][i[j]]['article'] (a string) and the second of which is the new url (also a string).

    Your problem comes at the later steps:

    From then on, it's exactly the same thing. G[i[z]][i[j]]['article'] is again a list with two elements, the first of which is its old value (a list) and the second is the new url (a string). So you've got a nested list.

    let's trace through with three urls: 'a', 'b', and 'c', and I'll use E to abbreviate G[i[z]][i[j]]. First time through, you get E='a'. Second time through you get E=['a', 'b']. Third time through it gives E=[['a','b'],'c']. So it's always making E[0] to be the former value of E, and E[1] to be the new url.

    Two choices:

    1) you can handle the creation of temp differently if you've got a string or a list. This is the bad choice.

    2)Better: Make it a list the whole time through and then don't even deal with temp. Try creating the edge as (...,article = [url]) and then just use G[i[z]][i[j]]['article'].append(url) instead of defining temp.

    So your code would be

    if G.has_edge(i[z],i[j]):
            G[i[z]][i[j]]['article'].append(url[index])
        else:
            G.add_edge(i[z],i[j], article=[url])
    

    A separate thing that could also cause you problems is the call

    del temp[:]
    

    This should cause behavior different from what I think you're describing. So I think this is a bit different from how it's actually coded. When you set G[i[z]][i[j]] = temp and then do del temp[:], you've made the two lists to be one list with two different names. When you del temp[:] you're also doing it to G[i[z]][i[j]]. Consider the following

    temp = []
    temp.append(1)
    print temp
    > [1]    
    L = temp
    print L
    > [1]
    del temp[:]
    print L
    > []