python dataframe seaborn boxplot outliers

Removing outliers to calculate mean of a series/dataframe via boxplots?

I'm trying to calculate the average of each column (series) in a dataframe without the outliers. I used seaborn's boxplot for this task:

plt.figure(figsize=(50, 10),dpi=200)
sns.boxplot(x='Unit_Code',y='Leadtime',hue='Has_Weekend?',data=df ,palette='winter')
plt.xticks(rotation=90);

And that's what I got:

plot1

I would actually love to get the mean of each unit(x axis) without the outliers. The rational behind this, and correct me if I'm wrong, is that I'd like to get the average of this feature, without the outliers, as they skew it.

Thanks!

Solution

Removing outliers can be done in a number of ways. This example uses the z-score method for removing the outliers.

Once the outliers are removed, calculating the mean is as simple as calling the .mean() function on each column of the DataFrame, or using the .describe() function.

Without going into too much detail, the z-score is a method to determine how many standard deviations a value is from the mean. It's very simple really, just each value, minus the mean, divided by the standard deviation of the dataset. Generally speaking, with normally distributed data which sticks close to the mean, a z-score of 3 can be used as a filter - which is demonstrated in the case below.

This article might be of interest, regarding the detection and removal of outliers.

An easy method for calculating the z-score may be to use the scipy.stats module, with the docs referenced here.

For this example, I've synthesised a dataset, which can be found at the bottom of this answer. Additionally, as I'm more familliar with plotly than seaborn, I've chosen to use plotly for plotting.

Let's get on with it ...

Before:

This example code is irrelevant to the issue, just plotting code.

l = {'title': 'Boxplot - With Outliers'}
t = []
t.append({'y': df['AZGD01'], 'type': 'box', 'name': 'AZGD01'})
t.append({'y': df['AZPH01'], 'type': 'box', 'name': 'AZPH01'})
t.append({'y': df['AZPV01'], 'type': 'box', 'name': 'AZPV01'})

iplot({'data': t, 'layout': l})

Output:

Filtering using z-score:

This shows an example of how the z-score can be calculated on each column of a DataFrame, where the filtered values are stored to a second DataFrame.

Steps:

Iterate each column
Calculate the z-score using the scipy.stats.zscore() function
Filter to keep only records with a z-score > 3
Store into a new DataFrame

Example:

from scipy import stats

df_z = pd.DataFrame()

for c in df:
    # Calculate z-score for each column.
    z = stats.zscore(df[c])
    # Filter to keep records with z-scores < 3.
    df_z[f'{c}_z'] = df.loc[z<3, c]

After:

Again, just irrelevant plotting code - but notice the second (filtered) DataFrame is used for the plots.

l = {'title': 'Boxlot - Outliers (> 3 std) Removed'}
t = []
t.append({'y': df_z['AZGD01_z'], 'type': 'box', 'name': 'AZGD01'})
t.append({'y': df_z['AZPH01_z'], 'type': 'box', 'name': 'AZPH01'})
t.append({'y': df_z['AZPV01_z'], 'type': 'box', 'name': 'AZPV01'})

iplot({'data': t, 'layout': l})

Output:

Sample Dataset Construction:

Below is more irrelevant code, which was used to construct the sample dataset.

import numpy as np
import pandas as pd
from plotly.offline import iplot
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler((0, 100))

np.random.seed(7)
vals1 = mms.fit_transform(np.random.randn(1000).reshape(-1, 1)).ravel()
np.random.seed(3)
vals2 = mms.fit_transform(np.random.randn(1000).reshape(-1, 1)).ravel()
np.random.seed(73)
vals3 = mms.fit_transform(np.random.randn(1000).reshape(-1, 1)).ravel()
outl1 = np.arange(150, 200, 10)
outl2 = np.arange(200, 250, 10)
outl3 = np.arange(250, 300, 10)

data1 = np.concatenate([vals1, outl1])
data2 = np.concatenate([vals2, outl2])
data3 = np.concatenate([vals3, outl3])

df = pd.DataFrame({'AZGD01': data1, 'AZPH01': data2, 'AZPV01': data3})