I have a dataframe that looks as follow (click on the lick below):
df.head(10)
What I would like to do is to remove outliers from the target column (occupied_parking_spaces) when the value of the day column is equal to 6 for instance which refers to sunday (df[‘day’] == 6) using the normal distribution 68-95-99.7 rule.
I tried the following code :
df = df.mask((df['occupied_parking_spaces'] - df['occupied_parking_spaces'].mean()).abs() > 2 * df['occupied_parking_spaces'].std()).dropna()
This line of code removes outliers from the whole dataset no matter the independent variables but I only want to remove outliers from the occupied_parking_spacs column where the day value is equal to 6 for exemple.
What I can do is to create a different dataframe for which I will remove outliers:
sunday_df = df.loc[df['day'] == 0]
sunday_df = sunday_df.mask((sunday_df['occupied_parking_spaces'] - sunday_df['occupied_parking_spaces'].mean()).abs() > 2 * sunday_df['occupied_parking_spaces'].std()).dropna()
But by doing this I will get multiple dataframes for everday of the week that I will have to concatenate at the end, and this is something I do not want to do as there must be a way to do this inside the same dataframe.
Could you please help me out?
Having defined some function to remove outliers, you could use np.where
to apply it selectively:
import numpy as np
df = np.where(df['day'] == 0,
remove_outliers(df['occupied_parking_spaces']),
df['occupied_parking_spaces']
)