I have a dataframe like this
df = pd.DataFrame( {'Item':['A','A','A','B','B','C','C','C','C'],
'Name':[Tom,John,Paul,Tom,Frank,Tom, John, Richard, James],
'Weight:[2,2,2,3,3,5, 5, 5, 5]'})
df
Item Name Weight
A Tom 4
A John 4
A Paul 4
B Tom 3
B Frank 3
C Tom 5
C John 5
C Richard 5
C James 5
For each people I want the list of the people with same item averaged over the weight
df1
Name People Times
Tom [John, Paul, Frank, Richard, James] [(1/4+1/5),1/4,1/3,1/5,1/5]
John [Tom, Richard, James] [(1/4+1/5),1/5,1/5]
Paul [Tom, John] [1/4,1/4]
Frank [Tom] [1/3]
Richard [Tom, John, James] [1/5,1/5,1/5]
James [Tom, John, Richard] [1/5,1/5,1/5]
In order to count the times of collaboration without considering the weight
, I did:
#merge M:N by column Item
df1 = pd.merge(df, df, on=['Item'])
#remove duplicity - column Name_x == Name_y
df1 = df1[~(df1['Name_x'] == df1['Name_y'])]
#print df1
#create lists
df1 = df1.groupby('Name_x')['Name_y'].apply(lambda x: x.tolist()).reset_index()
print df1
Name_x Name_y
0 Frank [Tom]
1 James [Tom, John, Richard]
2 John [Tom, Paul, Tom, Richard, James]
3 Paul [Tom, John]
4 Richard [Tom, John, James]
5 Tom [John, Paul, Frank, John, Richard, James]
#get count by np.unique
df1['People'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[0])
df1['times'] = df1['Name_y'].apply(lambda a: np.unique((a), return_counts =True)[1])
#remove column Name_y
df1 = df1.drop('Name_y', axis=1).rename(columns={'Name_x':'Name'})
print df1
Name People times
0 Frank [Tom] [1]
1 James [John, Richard, Tom] [1, 1, 1]
2 John [James, Paul, Richard, Tom] [1, 1, 1, 2]
3 Paul [John, Tom] [1, 1]
4 Richard [James, John, Tom] [1, 1, 1]
5 Tom [Frank, James, John, Paul, Richard] [1, 1, 2, 1, 1]
In the last dataframe I have the count of collaboration between all the pairs, however I would like their weighted counting of collaboration
Starting with:
df = pd.DataFrame({'Item': ['A', 'A', 'A', 'B', 'B', 'C', 'C', 'C', 'C'],
'Name': ['Tom', 'John', 'Paul', 'Tom', 'Frank', 'Tom', 'John', 'Richard', 'James'],
'Weight': [2, 2, 2, 3, 3, 5, 5, 5, 5]})
df1 = pd.merge(df, df, on=['Item'])
df1 = df1[~(df1['Name_x'] == df1['Name_y'])].set_index(['Name_x', 'Name_y']).drop(['Item', 'Weight_y'], axis=1)
You could use .apply()
to create the values and .unstack()
for the wide format:
collab = df1.groupby(level=['Name_x', 'Name_y']).apply(lambda x: np.sum(1/x)).unstack().loc[:, 'Weight_x']
Name_y Frank James John Paul Richard Tom
Name_x
Frank NaN NaN NaN NaN NaN 0.333333
James NaN NaN 0.2 NaN 0.2 0.200000
John NaN 0.2 NaN 0.5 0.2 0.700000
Paul NaN NaN 0.5 NaN NaN 0.500000
Richard NaN 0.2 0.2 NaN NaN 0.200000
Tom 0.333333 0.2 0.7 0.5 0.2 NaN
then iterate over the rows and convert to lists:
df = pd.DataFrame(columns=['People', 'Times'])
for p, data in collab.iterrows():
s = data.dropna()
df.loc[p] = [s.index.tolist(), s.values]
People \
Frank [Tom]
James [John, Richard, Tom]
John [James, Paul, Richard, Tom]
Paul [John, Tom]
Richard [James, John, Tom]
Tom [Frank, James, John, Paul, Richard]
Times
Frank [0.333333333333]
James [0.2, 0.2, 0.2]
John [0.2, 0.5, 0.2, 0.7]
Paul [0.5, 0.5]
Richard [0.2, 0.2, 0.2]
Tom [0.333333333333, 0.2, 0.7, 0.5, 0.2]