I have a Pandas dataframe df1
like this:
ID col1 col2 col3
0 A 1 1 1
1 B 1 1 1
2 C 1 0 1
3 D 0 0 0
4 E 1 1 1
5 F 0 1 0
6 G 1 0 0
7 H 0 1 0
and another dataframe df2
like this:
ID col1 col2 col3 rating1 rating2
0 I 0 1 0 1 2
1 J 0 1 1 3 1
2 K 1 0 0 4 4
3 L 0 1 0 2 2
and I need to find the similarity (x==y).sum
(not of bitwise xor) between each row of df1
(except ID
) with the corresponding rows in df2
and use that similarity as a weight to compute average ratings for all the IDs in df1
.
For example:
Final ratings of A should be computed as:
rating1[A] = ( 1*1 + 2*3 + 1*4 + 1*2 ) / (1 + 2 + 1 + 1) = 13/5
My output should be another dataframe having the ratings for all the IDs in df1
after computing the similarity mean of ratings using df2
.
Output Dataframe:
ID rating1 rating2
0 A 13/5 2
1 B 13/5 2
... and so on till ID H
Please help me do this efficiently using pandas functions. I tried doing it by iterating over all rows, but both df1
and df2
are very huge tables and it is taking a lot of time.
Thanks in advance.
@WeNYoBen's answer is correct. Although, in my case, the dimensions of df1
and df2
are much higher than what I've shown here as an example.
In the third line (of @WeNYoBen's answer), the dimensions of s
would be len(df1
)*len(df2
) and both of them are huge in my case and give me a MemoryError. Is there any workaround for that, in the sense that I could split df2
into smaller parts and can compute on them individually?
Thanks again.
This is more like numpy
broadcast then dot
s1=df1.iloc[:,1:].values
s2=df2.iloc[:,1:-2].values
s=np.sum(s1[:,None]==s2,-1)
df1['rating1']=s.dot(df2.rating1)/s.sum(1)
df1['rating2']=s.dot(df2.rating2)/s.sum(1)
df1
Out[623]:
ID col1 col2 col3 rating1 rating2
0 A 1 1 1 2.600000 2.000000
1 B 1 1 1 2.600000 2.000000
2 C 1 0 1 3.666667 3.000000
3 D 0 0 0 2.428571 2.428571
4 E 1 1 1 2.600000 2.000000
5 F 0 1 0 2.111111 2.000000
6 G 1 0 0 3.000000 3.200000
7 H 0 1 0 2.111111 2.000000