I have a dataframe of users and their friends that looks like:
user_id | friend_id
1 3
1 4
2 3
2 5
3 4
I want to write a function in python
to compute the number of mutual friends for each pair:
user_id | friend_id | num_mutual
1 3 1
1 4 1
2 3 0
2 5 0
3 4 1
Currently I have:
def find_mutual(df):
num_mutual = []
for i in range(len(df)):
user, friend = df.loc[i, 'user_id'], df.loc[i, 'friend_id']
user_list = df[df.user_id == user].friend_id.tolist() + df[df.friend_id == user].user_id.tolist()
friend_list = df[df.user_id == friend].friend_id.tolist() + df[df.friend_id == friend].user_id.tolist()
mutual = len(list(set(user_list) & set(friend_list)))
num_mutual.append(mutual)
return num_mutual
It works fine for small datasets, but I'm running it on a dataset with millions of rows. It takes forever to run everything. I know it's not the ideal way to find the count. Is there a better algorithm in Python? Thanks in advance!
The [ugly] idea is to construct a 4 point path that starts with a user_id
and ends with the same user_id
. If such a path exists, then 2 starting points have mutual friends.
We start with:
df
user_id friend_id
0 1 3
1 1 4
2 2 3
3 2 5
4 3 4
Then you can do:
dff = df.append(df.rename(columns={"user_id":"friend_id","friend_id":"user_id"}))
df_new = dff.merge(dff, on="friend_id", how="outer")
df_new = df_new[df_new["user_id_x"]!= df_new["user_id_y"]]
df_new = df_new.merge(dff, left_on= "user_id_y", right_on="user_id")
df_new = df_new[df_new["user_id_x"]==df_new["friend_id_y"]]
df_out = df.merge(df_new, left_on=["user_id","friend_id"], right_on=["user_id_x","friend_id_x"], how="left",suffixes=("__","_"))
df_out["count"] = (~df_out["user_id_x"].isnull()).astype(int)
df_out[["user_id__","friend_id","count"]]
user_id__ friend_id count
0 1 3 1
1 1 4 1
2 2 3 0
3 2 5 0
4 3 4 1
A more elegant and straightforward way to use a graph approach
import networkx as nx
g = nx.from_pandas_edgelist(df, "user_id","friend_id")
nx.draw_networkx(g)
Then you can identify number of mutual friends as number of paths for 2 adjacent nodes (2 friends) for which a 3 node path exists:
from networkx.algorithms.simple_paths import all_simple_paths
for row in df.itertuples():
df.at[row[0],"count"] = sum([len(l)==3 for l in list(all_simple_paths(g, row[1], row[2]))])
print(df)
user_id friend_id count
0 1 3 1.0
1 1 4 1.0
2 2 3 0.0
3 2 5 0.0
4 3 4 1.0