Search code examples
python-3.xpandasdataframemedian

calculate the median by excluding outliers (+-25% of median of four columns) for values of four columns values


I have four columns containing the amount of unit sold in last 4 years. I want to calculate the median and then calculate the mean by excluding the column values based on condition (+-25% of median).

df = df[['Date','ID','amount']] # df has daily data

df['dayofYear'] = df['Date'].dt.dayofyear

df['Year_Lag1']= df.groupby(['ID','dayofYear'])['amount'].transform(lambda x: x.shift(1))

df['Year_Lag2']= df.groupby(['ID','dayofYear'])['amount'].transform(lambda x: x.shift(2))

df['Year_Lag3']= df.groupby(['ID','dayofYear'])['amount'].transform(lambda x: x.shift(3))

df['Year_Lag4']= df.groupby(['ID','dayofYear'])['amount'].transform(lambda x: x.shift(4))


df['YearLag_median']= df[['Year_Lag1','Year_Lag2','Year_Lag3','Year_Lag4']].median(axis=1) #median amount on same date (is there a better way to calculate median by avoiding outliers)

how to calculate the average by avoiding values in four columns that lies in +- 25% of median.

suppose Year_Lag1 = 5000, Year_Lag2= 230, Year_Lag3=4500, Year_Lag4= 4300.

how to calculate average by avoiding Year_Lag2 value.

I want this to be done for all rows in a dataframe.

(also if someone can help with a better way to calculate median by avoiding outliers) Data set[data has values for 2nd and 3rd january (2014,15,16,17,18). Year_Lag1(shift(1)) has values from previous year for 2nd and 3rd jan. Year_Lag2(shift(2) has values from last to last year) so on..]

last row is the example where I want to ignore 589.0 to calculate the mean. [1]: https://i.sstatic.net/26Dvp.pngenter code here


Solution

  • Here is a solution, I think there should be something better to do it, but still it works:

    1. Define a function to calculate the mean according to your specifications
    def calculateMean(row):
        s = 0
        n = 0
        for i in range(4):
            if ~np.isnan(row[i]) and abs(row[i] - row[-1]) < 0.25 * row[-1]:
                s += row[i]
                n += 1
        return (s/n if n else np.nan)
    
    1. Apply this function on every rows
    df["YearLag_mean"] = df.loc[:, ['Year_Lag1','Year_Lag2','Year_Lag3','Year_Lag4', 'YearLag_median']]\
                           .apply(lambda row: calculateMean(row), axis=1)
    

    Output:

                Date   ID  amount  dayofYear  Year_Lag1  Year_Lag2  Year_Lag3  Year_Lag4  YearLag_median  YearLag_mean
    2258  2014-01-02  200  1778.0          2        NaN        NaN        NaN        NaN             NaN           NaN
    2259  2014-01-03  200  2149.0          3        NaN        NaN        NaN        NaN             NaN           NaN
    2623  2015-01-02  200  2057.0          2     1778.0        NaN        NaN        NaN          1778.0       1778.00
    2624  2015-01-03  200  2401.0          3     2149.0        NaN        NaN        NaN          2149.0       2149.00
    2988  2016-01-02  200  2315.0          2     2057.0     1778.0        NaN        NaN          1917.5       1917.50
    2989  2016-01-03  200   589.0          3     2401.0     2149.0        NaN        NaN          2275.0       2275.00
    3354  2017-01-02  200  1709.0          2     2315.0     2057.0     1778.0        NaN          2057.0       2050.00
    3355  2017-01-03  200  1659.0          3      589.0     2401.0     2149.0        NaN          2149.0       2275.00
    3719  2018-01-02  200  1991.0          2     1709.0     2315.0     2057.0     1778.0          1917.5       1964.75
    3720  2018-01-03  200  1570.0          3     1659.0      589.0     2401.0     2149.0          1904.0       1904.00
    

    As you can see, the last row didn't use 589 but also 2401 because of your threshold.

    To remove outliers instead of using the median, you can look at IQR or Z-score, but I'm not sure it works well on small data, you can give it try and just adapt or create new functions.