I have four columns containing the amount of unit sold in last 4 years. I want to calculate the median and then calculate the mean by excluding the column values based on condition (+-25% of median).
df = df[['Date','ID','amount']] # df has daily data
df['dayofYear'] = df['Date'].dt.dayofyear
df['Year_Lag1']= df.groupby(['ID','dayofYear'])['amount'].transform(lambda x: x.shift(1))
df['Year_Lag2']= df.groupby(['ID','dayofYear'])['amount'].transform(lambda x: x.shift(2))
df['Year_Lag3']= df.groupby(['ID','dayofYear'])['amount'].transform(lambda x: x.shift(3))
df['Year_Lag4']= df.groupby(['ID','dayofYear'])['amount'].transform(lambda x: x.shift(4))
df['YearLag_median']= df[['Year_Lag1','Year_Lag2','Year_Lag3','Year_Lag4']].median(axis=1) #median amount on same date (is there a better way to calculate median by avoiding outliers)
how to calculate the average by avoiding values in four columns that lies in +- 25% of median.
suppose Year_Lag1 = 5000, Year_Lag2= 230, Year_Lag3=4500, Year_Lag4= 4300
.
how to calculate average by avoiding Year_Lag2
value.
I want this to be done for all rows in a dataframe.
(also if someone can help with a better way to calculate median by avoiding outliers) Data set[data has values for 2nd and 3rd january (2014,15,16,17,18). Year_Lag1(shift(1)) has values from previous year for 2nd and 3rd jan. Year_Lag2(shift(2) has values from last to last year) so on..]
last row is the example where I want to ignore 589.0 to calculate the mean.
[1]: https://i.sstatic.net/26Dvp.pngenter code here
Here is a solution, I think there should be something better to do it, but still it works:
def calculateMean(row):
s = 0
n = 0
for i in range(4):
if ~np.isnan(row[i]) and abs(row[i] - row[-1]) < 0.25 * row[-1]:
s += row[i]
n += 1
return (s/n if n else np.nan)
df["YearLag_mean"] = df.loc[:, ['Year_Lag1','Year_Lag2','Year_Lag3','Year_Lag4', 'YearLag_median']]\
.apply(lambda row: calculateMean(row), axis=1)
Output:
Date ID amount dayofYear Year_Lag1 Year_Lag2 Year_Lag3 Year_Lag4 YearLag_median YearLag_mean
2258 2014-01-02 200 1778.0 2 NaN NaN NaN NaN NaN NaN
2259 2014-01-03 200 2149.0 3 NaN NaN NaN NaN NaN NaN
2623 2015-01-02 200 2057.0 2 1778.0 NaN NaN NaN 1778.0 1778.00
2624 2015-01-03 200 2401.0 3 2149.0 NaN NaN NaN 2149.0 2149.00
2988 2016-01-02 200 2315.0 2 2057.0 1778.0 NaN NaN 1917.5 1917.50
2989 2016-01-03 200 589.0 3 2401.0 2149.0 NaN NaN 2275.0 2275.00
3354 2017-01-02 200 1709.0 2 2315.0 2057.0 1778.0 NaN 2057.0 2050.00
3355 2017-01-03 200 1659.0 3 589.0 2401.0 2149.0 NaN 2149.0 2275.00
3719 2018-01-02 200 1991.0 2 1709.0 2315.0 2057.0 1778.0 1917.5 1964.75
3720 2018-01-03 200 1570.0 3 1659.0 589.0 2401.0 2149.0 1904.0 1904.00
As you can see, the last row didn't use 589
but also 2401
because of your threshold.
To remove outliers instead of using the median, you can look at IQR or Z-score, but I'm not sure it works well on small data, you can give it try and just adapt or create new functions.