Search code examples
pythonpandasdataframecalculated-columnsweighted-average

How to compute weighted sum of all elements in a row in pandas?


I have a pandas data frame with multiple columns. I want to create a new column weighted_sum from the values in the row and another column vector dataframe weight

weighted_sum should have the following value:

row[weighted_sum] = row[col0]*weight[0] + row[col1]*weight[1] + row[col2]*weight[2] + ...

I found the function sum(axis=1), but it doesn't let me multiply with weight.

Edit: I changed things a bit.

weight looks like this:

     0
col1 0.5
col2 0.3
col3 0.2

df looks like this:

col1 col2 col3
1.0  2.2  3.5
6.1  0.4  1.2

df*weight returns a dataframe full of Nan values.


Solution

  • The problem is that you're multiplying a frame with a frame of a different size with a different row index. Here's the solution:

    In [121]: df = DataFrame([[1,2.2,3.5],[6.1,0.4,1.2]], columns=list('abc'))
    
    In [122]: weight = DataFrame(Series([0.5, 0.3, 0.2], index=list('abc'), name=0))
    
    In [123]: df
    Out[123]:
               a          b          c
    0       1.00       2.20       3.50
    1       6.10       0.40       1.20
    
    In [124]: weight
    Out[124]:
               0
    a       0.50
    b       0.30
    c       0.20
    
    In [125]: df * weight
    Out[125]:
               0          a          b          c
    0        nan        nan        nan        nan
    1        nan        nan        nan        nan
    a        nan        nan        nan        nan
    b        nan        nan        nan        nan
    c        nan        nan        nan        nan
    

    You can either access the column:

    In [126]: df * weight[0]
    Out[126]:
               a          b          c
    0       0.50       0.66       0.70
    1       3.05       0.12       0.24
    
    In [128]: (df * weight[0]).sum(1)
    Out[128]:
    0         1.86
    1         3.41
    dtype: float64
    

    Or use dot to get back another DataFrame

    In [127]: df.dot(weight)
    Out[127]:
               0
    0       1.86
    1       3.41
    

    To bring it all together:

    In [130]: df['weighted_sum'] = df.dot(weight)
    
    In [131]: df
    Out[131]:
               a          b          c  weighted_sum
    0       1.00       2.20       3.50          1.86
    1       6.10       0.40       1.20          3.41
    

    Here are the timeits of each method, using a larger DataFrame.

    In [145]: df = DataFrame(randn(10000000, 3), columns=list('abc'))
    weight
    In [146]: weight = DataFrame(Series([0.5, 0.3, 0.2], index=list('abc'), name=0))
    
    In [147]: timeit df.dot(weight)
    10 loops, best of 3: 57.5 ms per loop
    
    In [148]: timeit (df * weight[0]).sum(1)
    10 loops, best of 3: 125 ms per loop
    

    For a wide DataFrame:

    In [162]: df = DataFrame(randn(10000, 1000))
    
    In [163]: weight = DataFrame(randn(1000, 1))
    
    In [164]: timeit df.dot(weight)
    100 loops, best of 3: 5.14 ms per loop
    
    In [165]: timeit (df * weight[0]).sum(1)
    10 loops, best of 3: 41.8 ms per loop
    

    So, dot is faster and more readable.

    NOTE: If any of your data contain NaNs then you should not use dot you should use the multiply-and-sum method. dot cannot handle NaNs since it is just a thin wrapper around numpy.dot() (which doesn't handle NaNs).