Search code examples
pythonpandasdataframe

Pandas - Merge DataFrame, keep non-null values on common columns, keep average on another column


I am working with two DataFrames, defined as such:

import pandas as pd
df1 = pd.DataFrame([[1, 'a', 0.95], [2, 'b', 0.92], [3, 'c',  0.91]], columns=['id','value','similarity'])

df2 = pd.DataFrame([[3, 'c', 0.93], [4, 'd', 0.92], [5, 'e',  0.99]], columns=['id','value','similarity'])

df1

id  name   similarity
1   a      0.95
2   b      0.92
3   c      0.91

df2

id  name   similarity
3   c      0.93
4   d      0.92
5   e      0.99

I now want to combine both DataFrames by the key attribute id, while avoiding the creation of NaNs in the name column. For this, I use pd.combine_first, as suggested in this answer, to get this:

df3 = df1.set_index('id').combine_first(df2.set_index('id'))
df3

id  name   similarity   
1   a      0.95
2   b      0.92
3   c      0.91
4   d      0.92
5   e      0.99

However, I have an additional requirement. When a similarity value exists on both sets, I want to take their average as the new value. So for instance my desired output would be something like:

id  name  similarity    
1   a      0.95
2   b      0.92
3   c      0.92 <-- (0.91 + 0.93 / 2)
4   d      0.92
5   e      0.99

Preferably using pandas or numpy, how can I average similarities when both DataFrames have a value?


Solution

  • You can concat, then groupby.mean:

    out = pd.concat([df1, df2]).groupby(['id', 'value'], as_index=False).mean()
    

    Output:

       id value  similarity
    0   1     a        0.95
    1   2     b        0.92
    2   3     c        0.92
    3   4     d        0.92
    4   5     e        0.99
    

    If you want to have the combine_first behavior for some columns and the average for others, you could use groupby.agg with first/mean depending on the columns:

    df1['x'] = 'one'
    df2['x'] = 'two'
    
    out = (pd.concat([df1, df2]).groupby(['id', 'value'], as_index=False)
             .agg({'similarity': 'mean', 'x': 'first'})
          )
    

    Output:

       id value  similarity    x
    0   1     a        0.95  one
    1   2     b        0.92  one
    2   3     c        0.92  one
    3   4     d        0.92  two
    4   5     e        0.99  two
    

    weighted average

    import numpy as np
    
    out = (pd.concat([df1.assign(w=0.7), df2.assign(w=0.3)])
             .groupby(['id', 'value'])
             .apply(lambda x: np.average(x['similarity'], weights=x['w']),
                    include_groups=False)
             .reset_index(name='weighted_avg')
          )
    

    Output:

       id value  weighted_avg
    0   1     a         0.950
    1   2     b         0.920
    2   3     c         0.916
    3   4     d         0.920
    4   5     e         0.990
    

    or:

    g = (pd.concat([df1.assign(w=0.7), df2.assign(w=0.3)])
           .eval('weighted_avg = similarity * w')
           .groupby(['id', 'value'])
        )
    out = (g.first()
            .assign(weighted_avg=g['weighted_avg'].sum()/g['w'].sum())
            .reset_index()
          )
    

    Output:

       id value  similarity    w  weighted_avg
    0   1     a        0.95  0.7         0.950
    1   2     b        0.92  0.7         0.920
    2   3     c        0.91  0.7         0.916
    3   4     d        0.92  0.3         0.920
    4   5     e        0.99  0.3         0.990