Search code examples
pythonpandasdataframenormal-distributionoutliers

Remove outliers from the target column when an independent variable column has a specific value


I have a dataframe that looks as follow (click on the lick below):

df.head(10)

https://ibb.co/vqmrkXb

What I would like to do is to remove outliers from the target column (occupied_parking_spaces) when the value of the day column is equal to 6 for instance which refers to sunday (df[‘day’] == 6) using the normal distribution 68-95-99.7 rule.

I tried the following code :

df = df.mask((df['occupied_parking_spaces'] - df['occupied_parking_spaces'].mean()).abs() > 2 * df['occupied_parking_spaces'].std()).dropna()

This line of code removes outliers from the whole dataset no matter the independent variables but I only want to remove outliers from the occupied_parking_spacs column where the day value is equal to 6 for exemple.

What I can do is to create a different dataframe for which I will remove outliers:

sunday_df = df.loc[df['day'] == 0]

sunday_df = sunday_df.mask((sunday_df['occupied_parking_spaces'] - sunday_df['occupied_parking_spaces'].mean()).abs() > 2 * sunday_df['occupied_parking_spaces'].std()).dropna()

But by doing this I will get multiple dataframes for everday of the week that I will have to concatenate at the end, and this is something I do not want to do as there must be a way to do this inside the same dataframe.

Could you please help me out?


Solution

  • Having defined some function to remove outliers, you could use np.where to apply it selectively:

    import numpy as np
    df = np.where(df['day'] == 0, 
            remove_outliers(df['occupied_parking_spaces']),
            df['occupied_parking_spaces']
         )