I am trying to automate removing outliers from a Pandas dataframe using IQR as the parameter and putting the variables in a list.
This code works - (where dummy_df is the dataframe and 'pdays' is the first variable I want to remove outliers for).
q1 = np.percentile(dummy_df['pdays'], 25, interpolation = 'midpoint')
q3 = np.percentile(dummy_df['pdays'], 75, interpolation = 'midpoint')
iqr = q3 - q1
upper = np.where(dummy_df['pdays'] >= (q3+1.5*iqr))
lower = np.where(dummy_df['pdays'] <= (q1-1.5*iqr))
dummy_df.drop(upper[0], inplace = True)
dummy_df.drop(lower[0], inplace = True)
print("New Shape: ", dummy_df.shape)
However, this doesn't -
remove_outliers = ['pdays','poutcome', 'campaign', 'previous']
for outlier in remove_outliers:
q1 = np.percentile(dummy_df[outlier], 25, interpolation = 'midpoint')
q3 = np.percentile(dummy_df[outlier], 75, interpolation = 'midpoint')
iqr = q3 - q1
upper = np.where(dummy_df[outlier] >= (q3+1.5*iqr))
lower = np.where(dummy_df[outlier] <= (q1-1.5*iqr))
dummy_df.drop(upper[0], inplace = True)
dummy_df.drop(lower[0], inplace = True)
print("New Shape: ", dummy_df.shape)
The error I am getting is different datatypes. But why? Isnt it the same thing? What am I missing?
I want to be able to run a For loop since I am going to be doing trial and error on the decision tree for the best accuracy. Dont want to be writing code every time I need to drop a variable or add a variable for which I want to remove outliers.
I have tried putting the dummy_df['pdays] etc in the remove_outliers list, as well as dummy_df.pdays, etc... I have tried using loc and iloc - though I don't think that's applicable. Not sure what to do next. Important thing is, I need to understand what is the difference - what am I missing?
Based on comments on the original post, I suggest you do the following and revamp your solution.
I believe this answer provides a quick solution to your problem, so remember to search on SO before posting. This will remove all rows where one (or more) of the wanted column values is an outlier.
cols = ['pdays', 'campaign', 'previous'] # The columns you want to search for outliers in
# Calculate quantiles and IQR
Q1 = dummy_df[cols].quantile(0.25) # Same as np.percentile but maps (0,1) and not (0,100)
Q3 = dummy_df[cols].quantile(0.75)
IQR = Q3 - Q1
# Return a boolean array of the rows with (any) non-outlier column values
condition = ~((dummy_df[cols] < (Q1 - 1.5 * IQR)) | (dummy_df[cols] > (Q3 + 1.5 * IQR))).any(axis=1)
# Filter our dataframe based on condition
filtered_df = dummy_df[condition]