I'm trying to calculate the average of each column (series) in a dataframe without the outliers. I used seaborn's boxplot for this task:
plt.figure(figsize=(50, 10),dpi=200)
sns.boxplot(x='Unit_Code',y='Leadtime',hue='Has_Weekend?',data=df ,palette='winter')
plt.xticks(rotation=90);
And that's what I got:
I would actually love to get the mean of each unit(x axis) without the outliers. The rational behind this, and correct me if I'm wrong, is that I'd like to get the average of this feature, without the outliers, as they skew it.
Thanks!
Removing outliers can be done in a number of ways. This example uses the z-score method for removing the outliers.
Once the outliers are removed, calculating the mean is as simple as calling the .mean()
function on each column of the DataFrame, or using the .describe()
function.
Without going into too much detail, the z-score is a method to determine how many standard deviations a value is from the mean. It's very simple really, just each value, minus the mean, divided by the standard deviation of the dataset. Generally speaking, with normally distributed data which sticks close to the mean, a z-score of 3 can be used as a filter - which is demonstrated in the case below.
This article might be of interest, regarding the detection and removal of outliers.
An easy method for calculating the z-score may be to use the scipy.stats
module, with the docs referenced here.
For this example, I've synthesised a dataset, which can be found at the bottom of this answer. Additionally, as I'm more familliar with plotly than seaborn, I've chosen to use plotly for plotting.
Let's get on with it ...
This example code is irrelevant to the issue, just plotting code.
l = {'title': 'Boxplot - With Outliers'}
t = []
t.append({'y': df['AZGD01'], 'type': 'box', 'name': 'AZGD01'})
t.append({'y': df['AZPH01'], 'type': 'box', 'name': 'AZPH01'})
t.append({'y': df['AZPV01'], 'type': 'box', 'name': 'AZPV01'})
iplot({'data': t, 'layout': l})
Output:
This shows an example of how the z-score can be calculated on each column of a DataFrame, where the filtered values are stored to a second DataFrame.
Steps:
scipy.stats.zscore()
functionExample:
from scipy import stats
df_z = pd.DataFrame()
for c in df:
# Calculate z-score for each column.
z = stats.zscore(df[c])
# Filter to keep records with z-scores < 3.
df_z[f'{c}_z'] = df.loc[z<3, c]
Again, just irrelevant plotting code - but notice the second (filtered) DataFrame is used for the plots.
l = {'title': 'Boxlot - Outliers (> 3 std) Removed'}
t = []
t.append({'y': df_z['AZGD01_z'], 'type': 'box', 'name': 'AZGD01'})
t.append({'y': df_z['AZPH01_z'], 'type': 'box', 'name': 'AZPH01'})
t.append({'y': df_z['AZPV01_z'], 'type': 'box', 'name': 'AZPV01'})
iplot({'data': t, 'layout': l})
Output:
Below is more irrelevant code, which was used to construct the sample dataset.
import numpy as np
import pandas as pd
from plotly.offline import iplot
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler((0, 100))
np.random.seed(7)
vals1 = mms.fit_transform(np.random.randn(1000).reshape(-1, 1)).ravel()
np.random.seed(3)
vals2 = mms.fit_transform(np.random.randn(1000).reshape(-1, 1)).ravel()
np.random.seed(73)
vals3 = mms.fit_transform(np.random.randn(1000).reshape(-1, 1)).ravel()
outl1 = np.arange(150, 200, 10)
outl2 = np.arange(200, 250, 10)
outl3 = np.arange(250, 300, 10)
data1 = np.concatenate([vals1, outl1])
data2 = np.concatenate([vals2, outl2])
data3 = np.concatenate([vals3, outl3])
df = pd.DataFrame({'AZGD01': data1, 'AZPH01': data2, 'AZPV01': data3})