Search code examples
pythonpandascsvjupyter-notebookrandom-forest

Outlier removal Isolation Forest


I've been trying to remove outliers from my database using isolation forest, but I can't figure out how. I've seen the examples for credit card fraud and Salary but I can't figure out how to apply them on each column as my database consists of 3862900 rows and 19 columns. I've uploaded an image of the head of my database. I can't figure out how to apply isolation forest on each column then permanently remove these outliers.enter image description here

Thank you.

enter image description here


Solution

  • According to the docs is used for detecting outliers not removing them

    df = pd.DataFrame({'temp': [1,2,3,345,6,7,5345, 8, 9, 10, 11]})
    clf = IsolationForest().fit(df['temp'].values.reshape(-1, 1)) 
    clf.predict([[4], [5], [3636]])
    

    array([ 1, 1, -1])

    As you can see from the output 4 and 5 are not outliers but 3636 is.

    If you want to remove outliers from your dataframe you should use the IQR

    quant = df['temp'].quantile([0.25, 0.75])
    df['temp'][~df['temp'].clip(*quant).isin(quant)]
    
    4     6
    5     7
    7     8
    8     9
    9    10
    

    As you can see the outliers have been removed

    For the whole df

    def IQR(df, colname, bounds = [.25, .75]):
        s = df[colname]
        q = s.quantile(bounds)
        return df[~s.clip(*q).isin(q)]
    

    Note: Isolation forest cannot remove outliers from your dataset, it is used to detect new outliers