Search code examples
python-3.xpandasscipyadjacency-matrixpdist

Create adjacency matrix by comparing all rows using a subset of string columns


I have a Pandas dataframe with three columns, id (a unique identifier) and then three string columns event_one, event_two and event_three, as follows:

test_df.head()

    id  event_one  event_two  event_three
0   N1  'aaa'      'abc'      'xyz'
1   N2  'bbb'      'abc'      'uvw'
2   N3  'ccc'      'def'      'xyz'
3   N4  'aaa'      'def'      'uvw'
4   N5  'aaa'      'abc'      'zzz'

I would like to create an adjacency matrix (as a dataframe) that, for all pairs of IDs, will test for string equality across each event column, and then compute the fraction of events that match out of the total number of events.

For example, if we compare N1 with N5, the associated entry in the adjacency matrix for that pair will be 2/3 (or 0.66..), since they match on event_one and event_two, but not event_three.

Using the test_df example above, the final adjacency matrix (represented as a Pandas dataframe) would look as follows:

      N1    N2    N3    N4    N5
      ----------------------------
N1|   -     1/3   1/3   1/3   2/3
N2|   1/3   -     0     1/3   1/3
N3|   1/3   0     -     1/3   0
N4|   1/3   1/3   1/3   -     1/3
N5|   2/3   1/3   0     1/3   -

The code I have written so far (see below) uses pdist and a lambda expression but is based on event_one only. How can I extend this to all three event columns and then calculate the final computation, or is there a better way of doing this?

import pandas as pd
from scipy.spatial.distance import pdist, squareform

# Create the example dataframe
events = {'id': ['N1', 'N2', 'N3', 'N4', 'N5'], 
          'event_one': ['aaa', 'bbb', 'ccc', 'aaa', 'aaa'],
          'event_two': ['abc', 'abc', 'def', 'def', 'abc'],
          'event_three': ['xyz', 'uvw', 'xyz', 'uvw', 'zzz']}
df = pd.DataFrame(events, columns=['id', 'event_one', 'event_two', 'event_three'])

# Create an adjacency matrix by comparing event_one using pdist

index_ndarray = df['id'].values
event_one_series = df['event_one']
dm = pdist(event_one_series[:, None], lambda u, v: 1 if u == v else 0)
event_one_matrix_df = pd.DataFrame(
    squareform(dm),
    dtype='str',
    index=index_ndarray,
    columns=index_ndarray)
event_one_matrix_df

    N1      N2      N3      N4      N5
N1  0.0     0.0     0.0     1.0     1.0
N2  0.0     0.0     0.0     0.0     0.0
N3  0.0     0.0     0.0     0.0     0.0
N4  1.0     0.0     0.0     0.0     1.0
N5  1.0     0.0     0.0     1.0     0.0

Any suggestions would be great, thank you.


Solution

  • Let's use filter to select the event like columns, then use pdist and pass a custom lambda function f which compares the pair wise rows and returns the fraction of events matched:

    from scipy.spatial.distance import pdist, squareform
    
    f = lambda u, v: (u == v).sum() / len(u)
    a = squareform(pdist(df.filter(like='event'), f))
    d = pd.DataFrame(a, index=df['id'].tolist(), columns=df['id'].tolist())
    

    print(d)
             N1        N2        N3        N4        N5
    N1  0.000000  0.333333  0.333333  0.333333  0.666667
    N2  0.333333  0.000000  0.000000  0.333333  0.333333
    N3  0.333333  0.000000  0.000000  0.333333  0.000000
    N4  0.333333  0.333333  0.333333  0.000000  0.333333
    N5  0.666667  0.333333  0.000000  0.333333  0.000000