Search code examples
pythonpandasnandata-cleaning

Fixing IndexingError to clean the data


I'm trying to identify outliers in each housing type category, but encountering an issue. Whenever I run the code, I receive the following error: "IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

grouped = df.groupby('Type')
q1 = grouped["price"].quantile(0.25)
q3 = grouped["price"].quantile(0.75)
iqr = q3 - q1

upper_bound = q3 + (1.5 * iqr)
lower_bound = q1 - (1.5 * iqr)

outliers = df[(df["price"].reset_index(drop=True) > upper_bound[df["Type"]].reset_index(drop=True)) | (df["price"].reset_index(drop=True) < lower_bound[df["Type"].reset_index(drop=True)])]
print(outliers)

When I run this part of the code

(df["price"].reset_index(drop=True) > upper_bound[df["Type"]].reset_index(drop=True)).reset_index(drop = True)

I'm getting boolean Series, but when I put it in the df[] it breaks.


Solution

  • Use transform to compute q1/q3, this will maintain the original index:

    q1 = grouped["price"].transform(lambda x: x.quantile(0.25))
    q3 = grouped["price"].transform(lambda x: x.quantile(0.75))
    
    iqr = q3 - q1
    
    upper_bound = q3 + (1.5 * iqr)
    lower_bound = q1 - (1.5 * iqr)
    
    outliers = df[df["price"].gt(upper_bound) | df["price"].lt(lower_bound)]