I am new to Ml. I am removing outliers with z-scores with the code given below. The problem I am facing is that when I remove outliers, it still leaves some values as outliers. Can anyone explain why is this so? Isn't Z-score a reliable method to remove all the outliers from the data?
I am calculating the z-score second time to know if there are any data points still left.
for feature in numerical_features:
data = pd.DataFrame(housing[feature], columns=[feature])
data = data.copy()
z_scores = np.abs(stats.zscore(data[feature]))
print("Before z score on ", feature, " ====> ", data[z_scores > 3].shape)
data[z_scores > 3] = data[feature].median()
z_scores = np.abs(stats.zscore(data[feature]))
print("After z score on ", feature, " ===> ", data[z_scores > 3].shape)
housing[feature] = data[feature]
print()
Before z-score is the first time I apply z-score and tells me how many values will be impact when I replace with median. After means, how many values are still left as outliers? https://i.sstatic.net/g6kn2.png
The z-score tells you how many standard deviations away from the mean a certain point is. Using |z-score| > 3 is a very common way to identify outliers. What you are missing, is that when you remove/replace outliers, the standard deviation of your new distribution is different than it used to be, thus the z-scores of all remaining points are slightly different. In many cases, the change is negligible; however, there are some cases where the change in z-score is more pronounced.
Depending on your application, you may wish to run the z-score filter a couple times until you get a stable distribution. Also, depending on your application, you may consider dropping outlier data instead of replacing them with the median. Hopefully you know why you chose to replace and the caveats associated with that choice.