Search code examples
pythonpython-3.xpandasdataframemutual-friendship

Find the number of mutual friends in Python


I have a dataframe of users and their friends that looks like:

user_id | friend_id
1         3
1         4
2         3
2         5
3         4

I want to write a function in python to compute the number of mutual friends for each pair:

user_id | friend_id | num_mutual
1         3           1
1         4           1
2         3           0
2         5           0
3         4           1

Currently I have:

def find_mutual(df):
    num_mutual = []
    for i in range(len(df)):
        user, friend = df.loc[i, 'user_id'], df.loc[i, 'friend_id']
        user_list = df[df.user_id == user].friend_id.tolist() + df[df.friend_id == user].user_id.tolist()
        friend_list = df[df.user_id == friend].friend_id.tolist() + df[df.friend_id == friend].user_id.tolist()
        mutual = len(list(set(user_list) & set(friend_list)))
        num_mutual.append(mutual)
    return num_mutual

It works fine for small datasets, but I'm running it on a dataset with millions of rows. It takes forever to run everything. I know it's not the ideal way to find the count. Is there a better algorithm in Python? Thanks in advance!


Solution

  • The [ugly] idea is to construct a 4 point path that starts with a user_id and ends with the same user_id. If such a path exists, then 2 starting points have mutual friends.

    We start with:

    df
              user_id  friend_id
    0        1          3
    1        1          4
    2        2          3
    3        2          5
    4        3          4
    

    Then you can do:

    dff = df.append(df.rename(columns={"user_id":"friend_id","friend_id":"user_id"}))
    df_new = dff.merge(dff, on="friend_id", how="outer")
    df_new = df_new[df_new["user_id_x"]!= df_new["user_id_y"]]
    df_new = df_new.merge(dff, left_on= "user_id_y", right_on="user_id")
    df_new = df_new[df_new["user_id_x"]==df_new["friend_id_y"]]
    df_out = df.merge(df_new, left_on=["user_id","friend_id"], right_on=["user_id_x","friend_id_x"], how="left",suffixes=("__","_"))
    df_out["count"] = (~df_out["user_id_x"].isnull()).astype(int)
    df_out[["user_id__","friend_id","count"]]
    
       user_id__  friend_id  count
    0          1          3      1
    1          1          4      1
    2          2          3      0
    3          2          5      0
    4          3          4      1
    

    A more elegant and straightforward way to use a graph approach

    import networkx as nx
    g = nx.from_pandas_edgelist(df, "user_id","friend_id")
    nx.draw_networkx(g)
    

    enter image description here

    Then you can identify number of mutual friends as number of paths for 2 adjacent nodes (2 friends) for which a 3 node path exists:

    from networkx.algorithms.simple_paths import all_simple_paths
    for row in df.itertuples():
        df.at[row[0],"count"] = sum([len(l)==3 for l in list(all_simple_paths(g, row[1], row[2]))])
    print(df)
       user_id  friend_id  count
    0        1          3    1.0
    1        1          4    1.0
    2        2          3    0.0
    3        2          5    0.0
    4        3          4    1.0