I am building a binary classification model on a heavily unbalanced dataset(95% 1s and 5% 0s). I want to drop the rows with outliers and I used the below code:
from scipy import stats
df=df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]
However, this code is dropping the rows that have my label 0. Is there a better way of dropping rows with outliers for all columns except the label column?
Try this (assume your label is located in df["label"]
):
df = df[(df["label"] == 0) | (np.abs(stats.zscore(df)) < 3).all(axis=1)]
The first condition will keep all rows with df["label"] == 0
disregard of the zscore
.