Search code examples
pythonpandasmeanmedianoutliers

dataframe mean calculation -> values that differ >20% from the median should be excluded from the mean-computation


I want to calculate the row-wise mean of columns y_2010, y_2011, y_2012, y_2013, y_2014 of the dataframe (energy use data per year), however:

  • values that differ more then 20% from the median (of the five values), should be excluded from the mean computation.
  • if less then two values in each row remain (after the condition above), the mean is set to NaN as one value is not enough to have a reliable mean -> so the mean can only be calculated for rows that contain two or more values after the '20% difference condition' above. (see ID(36): one value remains after the first condition, but that's not enough for a reliable mean so it's set to NaN)

Calculation of the mean of 5 columns is easy, but I'm stuck at defining the conditions 'if median*0.8 <= one of the values in the data row <= median*1,2 then mean == mean of values within the boundary and 2 or more values are present.

So I'm trying to calculate the mean only for the data rows with no 'outliers'.

Initial df:

ID  y_2010   y_2011   y_2012  y_2013  y_2014
23   22631  21954.0  22314.0   22032   21843
43   27456  29654.0  28159.0   28654    2000
36   61200      NaN      NaN   31895    1600
87   87621  86542.0  87542.0   88456   86961
90   58951  57486.0   2000.0       0       0
98   24587  25478.0      NaN   24896   25461

Desired df:

   ID  y_2010   y_2011   y_2012  y_2013  y_2014      mean
0  23   22631  21954.0  22314.0   22032   21843   22154.8
1  43   27456  29654.0  28159.0   28654    2000  28480.75
2  36   61200      NaN      NaN   31895    1600       NaN
3  87   87621  86542.0  87542.0   88456   86961   87424.4
4  90   58951  57486.0   2000.0       0       0       NaN
5  98   24587  25478.0      NaN   24896   25461   25105.5

Tried code so far (I'm stuck at getting the conditions right and apply them to the dataframe):

import pandas as pd
import numpy as np

df = pd.DataFrame({"ID": [23,43,36,87,90,98],
               "y_2010": [22631,27456,61200,87621,58951,24587], 
               "y_2011": [21954,29654,np.nan,86542,57486,25478],  
               "y_2012": [22314,28159,np.nan,87542,2000,np.nan],  
               "y_2013": [22032,28654,31895,88456,0,24896,],
               "y_2014": [21843,2000,1600,86961,0,25461]})
print(df)

a = df.loc[:, ['y_2010','y_2011','y_2012','y_2013', 'y_2014']]

# calculate median
median = a.median(1)
print(median)

# where condition is violated
mask = a.lt(median*.8, axis=0) | a.gt(median*1.2, axis=0)




Solution

  • I think your mask is right, then from there you can try this:

    col_mean = a[~mask].mean(axis=1)
    nan_mask = ~(mask.sum(axis=1) >= 2)
    
    a["mean"] = col_mean.where(nan_mask, other=np.NaN)
    print(a)
    

    Output:

       y_2010   y_2011  y_2012  y_2013  y_2014  mean
    0   22631   21954.0 22314.0 22032   21843   22154.80
    1   27456   29654.0 28159.0 28654   2000    28480.75
    2   61200   NaN     NaN     31895   1600    NaN
    3   87621   86542.0 87542.0 88456   86961   87424.40
    4   58951   57486.0 2000.0  0       0       NaN
    5   24587   25478.0 NaN     24896   25461   25105.50