Search code examples
pythonpandasgroup-byunique

Python: how to count collaborations between pairs in pandas dataframe?


I have a dataframe like this

df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'], 
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
 'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df 
Item Name  Weight
A    Tom     4
A    John    4
A    Paul    4
B    Tom     3
B    Frank   3
C    Tom     5
C    John    5
C    Richard 5
C    James   5 

For each people I want the list of the people with same item averaged over the weight

df1 
Name              People                          Times
Tom     [John, Paul, Frank, Richard, James]       [(1/4+1/5),1/4,1/3,1/5,1/5]
John    [Tom, Richard, James]                     [(1/4+1/5),1/5,1/5]
Paul    [Tom, John]                               [1/4,1/4]
Frank   [Tom]                                     [1/3]
Richard [Tom, John, James]                        [1/5,1/5,1/5]
James   [Tom, John, Richard]                      [1/5,1/5,1/5]

In order to count the times of collaboration without considering the weight, I did:

#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])

#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1

#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
    Name_x                                     Name_y
0    Frank                                      [Tom]
1    James                       [Tom, John, Richard]
2     John           [Tom, Paul, Tom, Richard, James]
3     Paul                                [Tom, John]
4  Richard                         [Tom, John, James]
5      Tom  [John, Paul, Frank, John, Richard, James]


#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
      Name                               People            times
0    Frank                                [Tom]              [1]
1    James                 [John, Richard, Tom]        [1, 1, 1]
2     John          [James, Paul, Richard, Tom]     [1, 1, 1, 2]
3     Paul                          [John, Tom]           [1, 1]
4  Richard                   [James, John, Tom]        [1, 1, 1]
5      Tom  [Frank, James, John, Paul, Richard]  [1, 1, 2, 1, 1]

In the last dataframe I have the count of collaboration between all the pairs, however I would like their weighted counting of collaboration


Solution

  • Starting with:

    df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'],
                       'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'],
                       'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]})
    
    df1 = pd.merge(df, df, on=['Item'])
    df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1)
    

    You could use .apply() to create the values and .unstack() for the wide format:

    collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']
    
    Name_y      Frank  James  John  Paul  Richard       Tom
    Name_x                                                 
    Frank         NaN    NaN   NaN   NaN      NaN  0.333333
    James         NaN    NaN   0.2   NaN      0.2  0.200000
    John          NaN    0.2   NaN   0.5      0.2  0.700000
    Paul          NaN    NaN   0.5   NaN      NaN  0.500000
    Richard       NaN    0.2   0.2   NaN      NaN  0.200000
    Tom      0.333333    0.2   0.7   0.5      0.2       NaN
    

    then iterate over the rows and convert to lists:

    df = pd.DataFrame(columns=['People', 'Times'])
    for p, data in collab.iterrows():
        s = data.dropna()
        df.loc[p] = [s.index.tolist(), s.values]
    
                                          People  \
    Frank                                  [Tom]   
    James                   [John, Richard, Tom]   
    John             [James, Paul, Richard, Tom]   
    Paul                             [John, Tom]   
    Richard                   [James, John, Tom]   
    Tom      [Frank, James, John, Paul, Richard]   
    
                                            Times  
    Frank                        [0.333333333333]  
    James                         [0.2, 0.2, 0.2]  
    John                     [0.2, 0.5, 0.2, 0.7]  
    Paul                               [0.5, 0.5]  
    Richard                       [0.2, 0.2, 0.2]  
    Tom      [0.333333333333, 0.2, 0.7, 0.5, 0.2]