I have a Pandas dataframe with three columns, id
(a unique identifier) and then three string columns event_one
, event_two
and event_three
, as follows:
test_df.head()
id event_one event_two event_three
0 N1 'aaa' 'abc' 'xyz'
1 N2 'bbb' 'abc' 'uvw'
2 N3 'ccc' 'def' 'xyz'
3 N4 'aaa' 'def' 'uvw'
4 N5 'aaa' 'abc' 'zzz'
I would like to create an adjacency matrix (as a dataframe) that, for all pairs of IDs, will test for string equality across each event column, and then compute the fraction of events that match out of the total number of events.
For example, if we compare N1
with N5
, the associated entry in the adjacency matrix for that pair will be 2/3 (or 0.66..), since they match on event_one
and event_two
, but not event_three
.
Using the test_df
example above, the final adjacency matrix (represented as a Pandas dataframe) would look as follows:
N1 N2 N3 N4 N5
----------------------------
N1| - 1/3 1/3 1/3 2/3
N2| 1/3 - 0 1/3 1/3
N3| 1/3 0 - 1/3 0
N4| 1/3 1/3 1/3 - 1/3
N5| 2/3 1/3 0 1/3 -
The code I have written so far (see below) uses pdist
and a lambda expression but is based on event_one
only. How can I extend this to all three event columns and then calculate the final computation, or is there a better way of doing this?
import pandas as pd
from scipy.spatial.distance import pdist, squareform
# Create the example dataframe
events = {'id': ['N1', 'N2', 'N3', 'N4', 'N5'],
'event_one': ['aaa', 'bbb', 'ccc', 'aaa', 'aaa'],
'event_two': ['abc', 'abc', 'def', 'def', 'abc'],
'event_three': ['xyz', 'uvw', 'xyz', 'uvw', 'zzz']}
df = pd.DataFrame(events, columns=['id', 'event_one', 'event_two', 'event_three'])
# Create an adjacency matrix by comparing event_one using pdist
index_ndarray = df['id'].values
event_one_series = df['event_one']
dm = pdist(event_one_series[:, None], lambda u, v: 1 if u == v else 0)
event_one_matrix_df = pd.DataFrame(
squareform(dm),
dtype='str',
index=index_ndarray,
columns=index_ndarray)
event_one_matrix_df
N1 N2 N3 N4 N5
N1 0.0 0.0 0.0 1.0 1.0
N2 0.0 0.0 0.0 0.0 0.0
N3 0.0 0.0 0.0 0.0 0.0
N4 1.0 0.0 0.0 0.0 1.0
N5 1.0 0.0 0.0 1.0 0.0
Any suggestions would be great, thank you.
Let's use filter
to select the event
like columns, then use pdist
and pass a custom lambda function f
which compares the pair wise rows and returns the fraction of events matched:
from scipy.spatial.distance import pdist, squareform
f = lambda u, v: (u == v).sum() / len(u)
a = squareform(pdist(df.filter(like='event'), f))
d = pd.DataFrame(a, index=df['id'].tolist(), columns=df['id'].tolist())
print(d)
N1 N2 N3 N4 N5
N1 0.000000 0.333333 0.333333 0.333333 0.666667
N2 0.333333 0.000000 0.000000 0.333333 0.333333
N3 0.333333 0.000000 0.000000 0.333333 0.000000
N4 0.333333 0.333333 0.333333 0.000000 0.333333
N5 0.666667 0.333333 0.000000 0.333333 0.000000