Search code examples
pythonpandaspython-3.7last.fmknowledge-graph

How can I build a relation between columns in pandas?


I have a data frame from the LASTFM dataset with columns: user_id | friend_id like so:

uid | fid
346 | 23
355 | 48

and I would like to have a relation between users as a third columns (kind of an adjacency vector), such as:

uid1 | uid2 | friends
23   | 48   | 0
23   | 56   | 0
23   | ..   | 0
23   | 346  | 1
48   | 23   | 0
48   | 56   | 0
48   | ..   | 0
48   | 346  | 0
48   | 355  | 1
23   | ..   | 0
23   | 346  | 1
346  | 23   | 1

I have tried working with merge, join, lambda but have so far been unsuccessful. Any help would be appreciated.


Solution

  • The strategy here is two step. First create the UID cross product dataset, then attach the friends indicator:

    Create the UID cross product by first taking the union of pairs from original dataset, along with their inverses. We'll create an intermediate dataset, friends, which we'll use later in the process to indicate which UIDs are friends:

    pairs = df.rename(columns={'uid': 'uid1', 'fid': 'uid2'})
    friends = pd.concat([pairs, pairs.rename(columns={'uid1': 'uid2', 'uid2':'uid1'})])
    uids = friends.uid1.drop_duplicates().to_frame(name='uid')
    
       uid
    0  346
    1  355
    0   23
    1   48
    

    Then, append a dummy merge key to allow a cross product merge:

    uids['dummy_key'] = 1
    uids = uids.merge(uids, on='dummy_key', suffixes=('1', '2'))[['uid1', 'uid2']]
    
        uid1  uid2
    0    346   346
    1    346   355
    2    346    23
    3    346    48
    4    355   346
    5    355   355
    ...
    

    Now, we merge on the friends dataset, attaching an indicator column, to start our adjacency list:

    adj = uids.merge(friends, on=['uid1', 'uid2'], how='left', indicator=True)
    
        uid1  uid2     _merge
    0    346   346  left_only
    1    346   355  left_only
    2    346    23       both
    3    346    48  left_only
    4    355   346  left_only
    5    355   355  left_only
    ...
    

    Finally, we code the _merge indicator into the friend column:

    adj['friends'] = adj.apply(lambda row: 1 if row['_merge'] == 'both' else 0, axis=1)
    adj = adj[['uid1', 'uid2', 'friends']]
    
        uid1  uid2  friends
    0    346   346        0
    1    346   355        0
    2    346    23        1
    3    346    48        0
    4    355   346        0
    5    355   355        0