Search code examples
pythonpandasnumpyoutliers

Want to remove outliers using df.drop(index=array) but error "array not found in axis"


My data for machine learning has several variables, below is a box plot of one of the variables (call this x) against the outcome (call this y). I want to remove the outliers in x, but only for x = 0, 1, 2, 3, 4, as there are no outliers for x = 5 and above.

Box plot of variable x against outcome y

I used the function below to try to remove the outliers using interquartile range (IQR) method:

import pandas as pd 
import numpy as np

# Load the dataset
df = pd.read_csv('.xxx.csv')

# Function to remove outliers
def remove_outlier_using_IQR(df: pd.DataFrame, name_column: str, value: int) -> pd.DataFrame:
    """
    Remove outliers using IQR in the 'name_column'.

    Args:
    df (pd.DataFrame): The DataFrame containing the columns for outlier removal.
    name_column (str): The name of the column containing outliers to be removed.
    Value: Value in name column for outlier removal.

    Returns:
    pd.DataFrame: The DataFrame with outliers in 'name_column' removed.
    """

# Detect outliers in the 'name_column'
    df2 = df[df[name_column]==value]
    Q1 = df2['final_test'].quantile(0.25)
    Q3 = df2['final_test'].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5*IQR
    upper = Q3 + 1.5*IQR
    df2.info()
    print(df2.shape)
    print(Q3)

# Create arrays of Boolean values indicating the outlier rows
    upper_array = np.where(df2['final_test'] >= upper)[0]
    lower_array = np.where(df2['final_test'] <= lower)[0]

    print(upper)
    print(upper_array)

# Removing the outliers
    df2.drop(index=upper_array, inplace=True)
    df2.drop(index=lower_array, inplace=True)

    df3 = df[df[name_column]!=value]
    df_merged = pd.concat([df2,df3], ignore_index=False, sort=False)

    return df_merged

# Use function to remove outliers
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(0))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(1))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(2))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(3))
df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(4))

Running the code gives:

print(df2.shape) => (56,18)
print(Q3) => 48.0
print(upper) => 55.5
print(upper_array) => [15 25 34 53]

KeyError                                  Traceback (most recent call last)
Cell In[36], line 46
     43     return df_merged
     45 # Use staticmethod fundction above to remove outliers identified from EDA boxplots
---> 46 df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(0))
     47 df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(1))
     48 df = remove_outlier_using_IQR(df=df, name_column='hours_per_week', value=int(2))

Cell In[36], line 37
     34 print(upper_array)
     36 # Removing the outliers
---> 37 df2.drop(index=upper_array, inplace=True)
     38 df2.drop(index=lower_array, inplace=True)
     40 df3 = df[df[name_column]!=value]

File c:\Users\xxx\AppData\Local\anaconda3\envs\hdbenv\Lib\site-packages\pandas\core\frame.py:5581, in DataFrame.drop(self, labels, axis, index, columns, level, inplace, errors)
   5433 def drop(
   5434     self,
   5435     labels: IndexLabel | None = None,
   (...)
   5442     errors: IgnoreRaise = "raise",
   5443 ) -> DataFrame | None:
   5444     """
   5445     Drop specified labels from rows or columns.
...
-> 7070         raise KeyError(f"{labels[mask].tolist()} not found in axis")
   7071     indexer = indexer[~mask]
   7072 return self.delete(indexer)

KeyError: '[15, 25, 34, 53] not found in axis'

I want to use the code df2.drop(index=upper_array, inplace=True) to drop samples with index [15, 25, 34, 53] as they are outliers, however there is error saying [15, 25, 34, 53] not found in axis.


Solution

  • You need to convert the positional indices to correspondin labels in df2.index. You can do that by indexing df2.index with those arrays:

    So do this:

    df2.drop(df2.index[upper_array], inplace=True)
    df2.drop(df2.index[lower_array], inplace=True)
    

    That way, you are telling .drop() actual row labels from df2.index rather than numeric positions.

    np.where(...) giving you positional indices (0,1,2, …) relative to df2 rather than actual row labels in df2. When you do

    df2.drop(index=upper_array, inplace=True)
    

    pandas interprets upper_array as actual row labels that must exist in df2.index. So KeyError [15, 25, 34,53] not found in axis simply means that df2 does not have row labels [15, 25, 34, 53].