Search code examples
pythonpandaspercentileiqr

How do I get rid of abnormalities from Pandas?


If I want to remove values that do not exist between -2σ and 2σ, how do I remove outliers using iqr?

I implemented this equation as follows.

iqr = df['abc'].percentile(0.75) - df['abc'].percentile(0.25)

cond1 = (df['abc'] > df['abc'].percentile(0.75) + 2 * iqr)
cond2 = (df['abc'] < df['abc'].percentile(0.25) - 2 * iqr)

df[cond1 & cond2]

Is this the right way?


Solution

  • This is not right. Your iqr is almost never equal to σ. Quartiles and deviations are not the same things.

    Fortunately, you can easily compute the standard deviation of a numerical Series using Series.std().

    sigma = df['abc'].std()
    
    cond1 = (df['abc'] > df['abc'].mean() - 2 * sigma)
    cond2 = (df['abc'] < df['abc'].mean() + 2 * sigma)
    
    df[cond1 & cond2]