Search code examples
pythonpandasdataframevectormatrix-multiplication

Pandas - Take bitwise similarly and calculate weighted mean based on that


I have a Pandas dataframe df1 like this:

  ID  col1 col2 col3
0  A   1    1    1
1  B   1    1    1
2  C   1    0    1
3  D   0    0    0
4  E   1    1    1
5  F   0    1    0
6  G   1    0    0
7  H   0    1    0

and another dataframe df2 like this:

  ID  col1 col2 col3 rating1  rating2
0  I   0    1    0     1        2
1  J   0    1    1     3        1
2  K   1    0    0     4        4
3  L   0    1    0     2        2

and I need to find the similarity (x==y).sum (not of bitwise xor) between each row of df1 (except ID) with the corresponding rows in df2 and use that similarity as a weight to compute average ratings for all the IDs in df1.

For example:

Final ratings of A should be computed as:

rating1[A] = ( 1*1 + 2*3 + 1*4 + 1*2 ) / (1 + 2 + 1 + 1) = 13/5

My output should be another dataframe having the ratings for all the IDs in df1 after computing the similarity mean of ratings using df2.

Output Dataframe:

  ID rating1  rating2
0  A   13/5      2
1  B   13/5      2
... and so on till ID H

Please help me do this efficiently using pandas functions. I tried doing it by iterating over all rows, but both df1 and df2 are very huge tables and it is taking a lot of time.

Thanks in advance.


Update

@WeNYoBen's answer is correct. Although, in my case, the dimensions of df1 and df2 are much higher than what I've shown here as an example.

In the third line (of @WeNYoBen's answer), the dimensions of s would be len(df1)*len(df2) and both of them are huge in my case and give me a MemoryError. Is there any workaround for that, in the sense that I could split df2 into smaller parts and can compute on them individually?

Thanks again.


Solution

  • This is more like numpy broadcast then dot

    s1=df1.iloc[:,1:].values
    s2=df2.iloc[:,1:-2].values
    s=np.sum(s1[:,None]==s2,-1)
    df1['rating1']=s.dot(df2.rating1)/s.sum(1)
    df1['rating2']=s.dot(df2.rating2)/s.sum(1)
    df1
    Out[623]: 
      ID  col1  col2  col3   rating1   rating2
    0  A     1     1     1  2.600000  2.000000
    1  B     1     1     1  2.600000  2.000000
    2  C     1     0     1  3.666667  3.000000
    3  D     0     0     0  2.428571  2.428571
    4  E     1     1     1  2.600000  2.000000
    5  F     0     1     0  2.111111  2.000000
    6  G     1     0     0  3.000000  3.200000
    7  H     0     1     0  2.111111  2.000000