Search code examples
pythonpandasnumpynetworkxgraph-theory

Importing non-square adjacency matrix into Networkx python


I have some data in pandas dataframe form below, where the columns represent discrete skills and the rows represent discrete jobs. A 1 is present only if the skill is required by the job, otherwise 0.

     skill_1, skill_2,
job_1      1,       0,       
job_2      0,       0,       
job_3      1,       1,       

I want to create a graph to visualize this relationship between jobs and skills, using networkx. I've tried two methods, one on the dataframe, itself, nx.from_pandas_adjacency and nx.from_numpy_matrix. The latter method was applied to a numpy representation of the dataframe, where the column and row names were removed.

In either situation, an error was raised because this is a non_square matrix. This makes sense as networkx is likely interpreting both columns and rows as the same set of nodes. However, the columns and nodes represent distinctly different things here. Two jobs are connected by the skill(s) they share and two skills are connected by the job(s) they share, but there is no direct edge between any two skills or any two jobs.

How can I import my data into networkx given that my rows and columns are different sets of nodes?


Solution

  • One option is to generate the missing rows and columns

    (I was curious about a vectorised method to achieve this, so I asked this question which has answers which provide such a method.)

    df = pd.DataFrame({'skill_1': {'job_1': 1, 'job_2': 0, 'job_3': 1},
     'skill_2': {'job_1': 0, 'job_2': 0, 'job_3': 1}})
    
    edges = df.columns
    
    for i in df.index:
        df[i] = [0 for _ in range(len(df.index))]
    
    for e in edges:
        df = df.append(pd.Series({c:0 for c in df.columns},name=e))
    

    Which gives us:

    >>> df
             skill_1  skill_2  job_1  job_2  job_3
    job_1          1        0      0      0      0
    job_2          0        0      0      0      0
    job_3          1        1      0      0      0
    skill_1        0        0      0      0      0
    skill_2        0        0      0      0      0
    

    And then we can read in to networkx using nx.from_pandas_adjacency (assuming you want a directed graph)

    G = nx.from_pandas_adjacency(df, create_using=nx.DiGraph)
    

    Alternatively, we can use df.stack()

    df = pd.DataFrame({'skill_1': {'job_1': 1, 'job_2': 0, 'job_3': 1},
     'skill_2': {'job_1': 0, 'job_2': 0, 'job_3': 1}})
    
    G = nx.DiGraph()
    
    for x,y in df.stack().reset_index().iterrows():
        G.add_node(y['level_0'])
        G.add_node(y['level_1'])
        if y[0]:
            G.add_edge(y['level_0'], y['level_1'])