My data for machine learning has several variables, below is a box plot of one of the variables (call this x) against the outcome (call this y). I want to remove the outliers in x, but only for x = 0, 1, 2, 3, 4, as there are no outliers for x = 5 and above.
I used the function below to try to remove the outliers using interquartile range (IQR) method:
import pandas as pd
import numpy as np
# Load the dataset
df = pd.read_csv('.xxx.csv')
# Function to remove outliers
def remove_outlier_using_IQR(df: pd.DataFrame, name_column: str, value: int) -> pd.DataFrame:
"""
Remove outliers using IQR in the 'name_column'.
Args:
df (pd.DataFrame): The DataFrame containing the columns for outlier removal.
name_column (str): The name of the column containing outliers to be removed.
Value: Value in name column for outlier removal.
Returns:
pd.DataFrame: The DataFrame with outliers in 'name_column' removed.
"""
# Detect outliers in the 'name_column'
df2 = df[df[name_column]==value]
Q1 = df2['final_test'].quantile(0.25)
Q3 = df2['final_test'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5*IQR
upper = Q3 + 1.5*IQR
df2.info()
print(df2.shape)
print(Q3)
# Create arrays of Boolean values indicating the outlier rows
upper_array = np.where(df2['final_test'] >= upper)[0]
lower_array = np.where(df2['final_test'] <= lower)[0]
print(upper)
print(upper_array)
# Removing the outliers
df2.drop(index=upper_array, inplace=True)
df2.drop(index=lower_array, inplace=True)
df3 = df[df[name_column]!=value]
df_merged = pd.concat([df2,df3], ignore_index=False, sort=False)
return df_merged
# Use function to remove outliers
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(0))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(1))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(2))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(3))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(4))
Running the code gives:
print(df2.shape) => (56,18)
print(Q3) => 48.0
print(upper) => 55.5
print(upper_array) => [15 25 34 53]
KeyError Traceback (most recent call last)
Cell In[36], line 46
43 return df_merged
45 # Use staticmethod fundction above to remove outliers identified from EDA boxplots
---> 46 df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(0))
47 df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(1))
48 df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(2))
Cell In[36], line 37
34 print(upper_array)
36 # Removing the outliers
---> 37 df2.drop(index=upper_array, inplace=True)
38 df2.drop(index=lower_array, inplace=True)
40 df3 = df[df[name_column]!=value]
File c:\Users\xxx\AppData\Local\anaconda3\envs\hdbenv\Lib\site-packages\pandas\core\frame.py:5581, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
5433 def drop(
5434 self,
5435 labels: IndexLabel | None = None,
(...)
5442 errors: IgnoreRaise = "raise",
5443 ) -> DataFrame | None:
5444 """
5445 Drop specified labels from rows or columns.
...
-> 7070 raise KeyError(f"{labels[mask].tolist()} not found in axis")
7071 indexer = indexer[~mask]
7072 return self.delete(indexer)
KeyError: '[15, 25, 34, 53] not found in axis'
I want to use the code df2.drop(index=upper_array, inplace=True)
to drop samples with index [15, 25, 34, 53]
as they are outliers, however there is error saying [15, 25, 34, 53]
not found in axis.
You need to convert the positional indices to correspondin labels in df2.index
. You can do that by indexing df2.index
with those arrays:
So do this:
df2.drop(df2.index[upper_array], inplace=True)
df2.drop(df2.index[lower_array], inplace=True)
That way, you are telling .drop()
actual row labels from df2.index
rather than numeric positions.
np.where(...)
giving you positional indices (0,1,2, …) relative to df2 rather than actual row labels in df2.
When you do
df2.drop(index=upper_array, inplace=True)
pandas interprets upper_array as actual row labels that must exist in df2.index
. So KeyError [15, 25, 34,53] not found in axis simply means that df2 does not have row labels [15, 25, 34, 53].