Search code examples
pythonpandasoutliersdrop

Droping rows with outliers from specific columns


I am building a binary classification model on a heavily unbalanced dataset(95% 1s and 5% 0s). I want to drop the rows with outliers and I used the below code:

from scipy import stats
df=df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

However, this code is dropping the rows that have my label 0. Is there a better way of dropping rows with outliers for all columns except the label column?


Solution

  • Try this (assume your label is located in df["label"]):

    df = df[(df["label"] == 0) | (np.abs(stats.zscore(df)) < 3).all(axis=1)]
    

    The first condition will keep all rows with df["label"] == 0 disregard of the zscore.