Search code examples
pythondata-scienceoutliers

What if there is a lot of outliers in your dataset


I was trying to deal with outliers in my dataset, but when checking their number, I recognized that in every column there is approximately 95% of outliers! Which is so weird.

So is it a good choice to replace these values using the IQR, or should I just keep it as it is?

def check_outliers(col) :
    outliers = []
    Q1 = col.quantile(.25)
    Q3 = col.quantile(.75)
    IQR = Q3 - Q1
    lowerLimit = Q1 - 1.5*IQR
    higherLimit = Q3 - 1.5*IQR
    
    for elt in col :
        if elt < lowerLimit or elt > higherLimit :
            outliers.append(elt)
            
    return np.array(outliers), lowerLimit, higherLimit


for col in train.columns :
    arr,lowerLimit,higherLimit = check_outliers(train[col])
    print(col, len(arr))
    
    train[col] = np.where(train[col]>higherLimit,higherLimit,train[col])
    train[col] = np.where(train[col] <lowerLimit,lowerLimit,train[col])

I thought that those values may be the result of some human errors or system failures. So we cannot simply accept or drop them as well since then we will miss other features data.

So I said why not using IQR?

However, after applying it the results of my model predictions were perfect which means that there is a problem!


Solution

  • For your higherlimit you've written Q3 - 1.5*IQR, but it should be + here instead. Currently your upper bound will be outputting much lower than it should be, hence returning 95% of outliers (which shouldn't be possible using LQ/UQ and IQR).

    def check_outliers(col) :
    outliers = []
    Q1 = col.quantile(.25)
    Q3 = col.quantile(.75)
    IQR = Q3 - Q1
    lowerLimit = Q1 - 1.5*IQR
    higherLimit = Q3 + 1.5*IQR
    
    for elt in col :
        if elt < lowerLimit or elt > higherLimit :
            outliers.append(elt)
            
    return np.array(outliers), lowerLimit, higherLimit