Search code examples
pandasmatplotlibstatisticsseabornoutliers

How to find the outliers from the data set and plot using Z score


Data set is below

store id,revenue ,profit
101,779183,281257
101,144829,838451
101,766465,757565
101,353297,261071
101,1615461,275760
102,246731,949229
102,951518,301016
102,444669,430583

Code is below

import pandas as pd
dummies1 = dummies[['storeid', 'revenue', 'profit']]
cols = list(dummies1.columns)
cols.remove('storeid')
dummies1[cols]
# code to find the z score
for col in cols:
    col_zscore = col + '_zscore'
    dummies1[col_zscore] = (dummies1[col] - dummies1[col].mean())/dummies1[col].std(ddof=0)

Here I need to scatter-plot, box plot with outliers, How to to do

How to find the outliers is below ?

let say threshold is 3 means np.abs(z_score) > threshold will consider as outlier.


Solution

  • Slicing the data based on the z-score will you you the data to plot. If you just want to find where one variable is an outlier you can do (for example):

    THRESHOLD = 1.5 #nothing > 3 in your example
    
    to_plot = dummies1[(np.abs(dummies1['revenue_zscore']) > THRESHOLD)]
    

    Or if either column can be an outlier, you can do:

    to_plot = dummies1[(np.abs(dummies1['revenue_zscore']) > THRESHOLD) | 
                       (np.abs(dummies1['profit_zscore']) > THRESHOLD)]
    

    You weren't very specific about the plot, but here's an example taking advantage of this (using ~ to reverse the detection of outliers for normal points):

    fig, ax = plt.subplots(figsize=(7,5))
    non_outliers = dummies1[~((np.abs(dummies1['revenue_zscore']) > THRESHOLD) | 
                            (np.abs(dummies1['profit_zscore']) > THRESHOLD))]
    outliers = dummies1[((np.abs(dummies1['revenue_zscore']) > THRESHOLD) | 
                        (np.abs(dummies1['profit_zscore']) > THRESHOLD))]
    
    ax.scatter(non_outliers['revenue'],non_outliers['profit'])
    ax.scatter(outliers['revenue'],outliers['profit'], color='red', marker='x')
    ax.set_ylabel('Profit')
    ax.set_xlabel('Revenue')
    

    enter image description here