I've been trying to remove outliers from my database using isolation forest, but I can't figure out how. I've seen the examples for credit card fraud and Salary but I can't figure out how to apply them on each column as my database consists of 3862900 rows and 19 columns. I've uploaded an image of the head of my database. I can't figure out how to apply isolation forest on each column then permanently remove these outliers.
Thank you.
According to the docs is used for detecting outliers not removing them
df = pd.DataFrame({'temp': [1,2,3,345,6,7,5345, 8, 9, 10, 11]})
clf = IsolationForest().fit(df['temp'].values.reshape(-1, 1))
clf.predict([[4], [5], [3636]])
array([ 1, 1, -1])
As you can see from the output 4
and 5
are not outliers but 3636 is.
If you want to remove outliers from your dataframe you should use the IQR
quant = df['temp'].quantile([0.25, 0.75])
df['temp'][~df['temp'].clip(*quant).isin(quant)]
4 6
5 7
7 8
8 9
9 10
As you can see the outliers have been removed
For the whole df
def IQR(df, colname, bounds = [.25, .75]):
s = df[colname]
q = s.quantile(bounds)
return df[~s.clip(*q).isin(q)]
Note: Isolation forest cannot remove outliers from your dataset, it is used to detect new outliers