Search code examples
pythonpandasadjacency-matrix

How can I create an adjacency matrix from a long list of source/target pairs?


Given the following data:

Class       Name
======      =============
Math        John Smith
-------------------------
Math        Jenny Simmons
-------------------------
English     Sarah Blume
-------------------------
English     John Smith
-------------------------
Chemistry   Roger Tisch
-------------------------
Chemistry   Jenny Simmons
-------------------------
Physics     Sarah Blume
-------------------------
Physics     Jenny Simmons

I have a list of classes and names in each, like so:

[
{class: 'Math', student: 'John Smith'},
{class: 'Math', student: 'Jenny Simmons'},
{class: 'English', student: 'Sarah Blume'},
{class: 'English', student: 'John Smith'},
{class: 'Chemistry', student: 'John Smith'},
{class: 'Chemistry', student: 'Jenny Simmons'},
{class: 'Physics', student: 'Sarah Blume'},
{class: 'Physics', student: 'Jenny Simmons'},
]

I'd like to create an adjacency matrix, which would, as input, have the following structure, showing the number of students in common between each pair of classes:

enter image description here

How would I be able to do so in python/pandas in the most performant manner? I've got ~19M of these class/student pairs (~240MB) in my list.


Solution

  • You can prepare the data for the adjacency matrix like this:

    # create the "class-tuples" by
    # joining the dataframe with itself
    df_cross= df.merge(df, on='student', suffixes=['_left', '_right'])
    # remove the duplicate tuples
    # --> this will get you a upper / or lower
    # triangular matrix with diagonal = 0
    # if you rather want to have a full matrix
    # just change the >= to == below
    del_indexer= (df_cross['class_left']>=df_cross['class_right'])
    df_cross.drop(df_cross[del_indexer].index, inplace=True)
    # create the counts / lists
    grouby_obj= df_cross.groupby(['class_left', 'class_right'])
    result= grouby_obj.count()
    result.columns= ['value']
    # if you want to have lists of student names
    # that have the course-combination in
    # common, you can do it with the following line
    # otherwise just remove it (I guess with a 
    # dataset of the size you mentioned, it will
    # consume a lot of memory)
    result['students']= grouby_obj.agg(list)
    

    The full output looks like this:

    Out[133]: 
                            value                     students
    class_left class_right                                    
    Chemistry  English          1                 [John Smith]
               Math             2  [John Smith, Jenny Simmons]
               Physics          1              [Jenny Simmons]
    English    Math             1                 [John Smith]
               Physics          1                [Sarah Blume]
    Math       Physics          1              [Jenny Simmons]
    

    You then can use @piRSquared's method to pivot it, or do it like this:

    result['value'].unstack()
    
    Out[137]: 
    class_right  English  Math  Physics
    class_left                         
    Chemistry        1.0   2.0      1.0
    English          NaN   1.0      1.0
    Math             NaN   NaN      1.0
    

    Or, if you also want the names:

    result.unstack()
    Out[138]: 
                  value                   students                                              
    class_right English Math Physics       English                         Math          Physics
    class_left                                                                                  
    Chemistry       1.0  2.0     1.0  [John Smith]  [John Smith, Jenny Simmons]  [Jenny Simmons]
    English         NaN  1.0     1.0           NaN                 [John Smith]    [Sarah Blume]
    Math            NaN  NaN     1.0           NaN                          NaN  [Jenny Simmons]