Search code examples
pythonnetworkxgraph-theorygephi

Python: how to convert elements of a list of lists into an undirected graph?


I have a program which retrieves a list of PubMed publications and wish to build a graph of co-authorship, meaning that for each article I want to add each author (if not already present) as a vertex and add an undirected edge (or increase its weight) between every coauthor.

I managed to write the first of the program which retrieves the list of authors for each publication and understand I could use the NetworkX library to build the graph (and then export it to GraphML for Gephi) but cannot wrap my head on how to transform the "list of lists" to a graph.

Here follows my code. Thank you very much.

### if needed install the required modules
### python3 -m pip install biopython
### python3 -m pip install numpy

from Bio import Entrez
from Bio import Medline
Entrez.email = "rja@it.com"
handle = Entrez.esearch(db="pubmed", term='("lung diseases, interstitial"[MeSH Terms] NOT "pneumoconiosis"[MeSH Terms]) AND "artificial intelligence"[MeSH Terms] AND "humans"[MeSH Terms]', retmax="1000", sort="relevance", retmode="xml")
records = Entrez.read(handle)
ids = records['IdList']
h = Entrez.efetch(db='pubmed', id=ids, rettype='medline', retmode='text')
#now h holds all of the articles and their sections
records = Medline.parse(h)
# initialize an empty vector for the authors
authors = []
# iterate through all articles
for record in records:
    #for each article (record) get the authors list
    au = record.get('AU', '?')
    # now from the author list iterate through each author
    for a in au: 
        if a not in authors:
            authors.append(a)
    # following is just to show the alphabetic list of all non repeating 
    # authors sorted alphabetically (there should become my graph nodes)
    authors.sort()
    print('Authors: {0}'.format(', '.join(authors)))

Solution

  • Cool - the code was running, so the data structures are clear! As an approach, we build the conncetivity-matrix for both articles/authors and authors/co-authors.

    List of authors : If you want to describe the relation between the articles and the authors, I think you need the author list of each article

    authors = []
    author_lists = []              # <--- new
    for record in records:
        au = record.get('AU', '?')
        author_lists.append(au)    # <--- new
        for a in au: 
            if a not in authors: authors.append(a)
    authors.sort()
    print(authors)
    

    numpy, pandas matplotlib - is just the way I am used to work

    import numpy as np
    import pandas as pd
    import matplotlib.pylab as plt
    
    AU = np.array(authors)        # authors as np-array
    NA = AU.shape[0]              # number of authors
    
    NL = len(author_lists)        # number of articles/author lists
    AUL = np.array(author_lists)  # author lists as np-array
    
    print('NA, NL', NA,NL)
    

    Connectivity articles/authors

    CON = np.zeros((NL,NA),dtype=int) # initializes connectivity matrix
    for j in range(NL):               # run through the article's author list 
        aul = np.array(AUL[j])        # get a single author list as np-array
        z = np.zeros((NA),dtype=int)
        for k in range(len(aul)):     # get a singel author
            z += (AU==aul[k])         # get it's position in the AU, add it  up
        CON[j,:] = z                  # insert the result in the connectivity matrix
    
    #---- grafics --------
    fig = plt.figure(figsize=(20,10)) ; 
    plt.spy(CON, marker ='s', color='chartreuse', markersize=5)
    plt.xlabel('Authors'); plt.ylabel('Articles'); plt.title('Authors of the articles', fontweight='bold')
    plt.show()
    

    enter image description here

    Connectivity authors/co-authors, the resulting matrix is symmetric

    df = pd.DataFrame(CON)          # let's use pandas for the following step
    ACON = np.zeros((NA,NA))         # initialize the conncetivity matrix
    for j in range(NA):              # run through the authors
        df_a = df[df.iloc[:, j] >0]  # give all rows with author j involved
        w = np.array(df_a.sum())     # sum the rows, store it in np-array 
        ACON[j] = w                  # insert it in the connectivity matrix
    
    #---- grafics --------
    fig = plt.figure(figsize=(10,10)) ; 
    plt.spy(ACON, marker ='s', color='chartreuse', markersize=3)
    plt.xlabel('Authors'); plt.ylabel('Authors'); plt.title('Authors that are co-authors', fontweight='bold')
    plt.show()
    

    enter image description here

    For the graphics with Networkx, I think think you need clear ideas what you want represent, because there are many points and many possibilities too (perhaps you post an example?). Only a few author-circels are ploted below.

    import networkx as nx
    
    def set_edges(Q):
        case = 'A'
        if case=='A':
            Q1 = np.roll(Q,shift=1)
            Edges = np.vstack((Q,Q1)).T
        return Edges
    
    Q = nx.Graph()
    Q.clear()
    
    AT = np.triu(ACON)                        # only the tridiagonal is needed
    fig = plt.figure(figsize=(7,7)) ;
    for k in range (9):
        iA = np.argwhere(AT[k]>0).ravel()     # get the indices with AT{k}>0
        Edges = set_edges(iA)                 # select the involved nodes and set the edges
        Q.add_edges_from(Edges, with_labels=True)
    nx.draw(Q, alpha=0.5)
    plt.title('Co-author-ship', fontweight='bold')
    plt.show()
    

    enter image description here