python pandas csv jupyter-notebook random-forest

Outlier removal Isolation Forest

I've been trying to remove outliers from my database using isolation forest, but I can't figure out how. I've seen the examples for credit card fraud and Salary but I can't figure out how to apply them on each column as my database consists of 3862900 rows and 19 columns. I've uploaded an image of the head of my database. I can't figure out how to apply isolation forest on each column then permanently remove these outliers.

Thank you.

Solution

According to the docs is used for detecting outliers not removing them

df = pd.DataFrame({'temp': [1,2,3,345,6,7,5345, 8, 9, 10, 11]})
clf = IsolationForest().fit(df['temp'].values.reshape(-1, 1)) 
clf.predict([[4], [5], [3636]])

array([ 1, 1, -1])

As you can see from the output 4 and 5 are not outliers but 3636 is.

If you want to remove outliers from your dataframe you should use the IQR

quant = df['temp'].quantile([0.25, 0.75])
df['temp'][~df['temp'].clip(*quant).isin(quant)]

As you can see the outliers have been removed

For the whole df

def IQR(df, colname, bounds = [.25, .75]):
    s = df[colname]
    q = s.quantile(bounds)
    return df[~s.clip(*q).isin(q)]

Note: Isolation forest cannot remove outliers from your dataset, it is used to detect new outliers