Currently working on a regression problem, I'm facing some issues in the performance of models. In order to have 'maybe' a better performance, I've some outliers that I'd like to remove.
Problem: Remove outliers from a dataframe containing different types.
The DF looks like:
df.dtypes
CONTRACT_TYPE object
CONTRACT_COC object
ORIGINATION_DATE datetime64[ns]
MATURITY_DATE datetime64[ns]
ORIGINAL_TERM float64
REMAINING_TERM int64
INTEREST_RATE_INTERNAL float64
INTEREST_RATE_FUNDING float64
However, after trying this code as shown bellow, without success and even the zscore, I'm asking some help.
# Computing IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]
To summarize, I'd like to see in the plots (scatter, boxplot) a more 'normal' distribution without or with the less of outliers.
Please, do not hesitate if you need more information.
First of all, I assume that your data distribution is Normal. Here is a great strategy for removing outliers.
Use sklearn.preprocessing.StandardScaler on your Dataframe. It standardize features by removing the mean and scaling to unit variance. The implementation is as easy as follows;
# Declare Sklearn standard_scaler
standard_scaler = StandardScaler(copy=True, with_mean=True, with_std=True)
# Fitting
standard_scaler.fit(x_train_df)
# Transforming
x_train_normal_scaled_df = standard_scaler.transform(x_train_df)
# Fitting and Transforming together
x_train_normal_scaled_df = x_scaler_lev1.fit_transform(x_train_df)
# Inverting the transformed data back.
x_train_df = standard_scaler.inverse_transform()
print(x_train_normal_scaled_df.describe())
x_train_normal_scaled_df.plot()
You should find out how much of your data is outlier. Empirical Rule of Normal Distribution can help here.
Experimentally, I always choose the data in the range of 3 times of standard deviation as my main data and out of this range would be the outlier. Normal distribution would guarantee that the main data have about 99.73% of information.