pandas jupyter-notebook numeric categorical-data outliers

Remove outliers from pandas with different types

Currently working on a regression problem, I'm facing some issues in the performance of models. In order to have 'maybe' a better performance, I've some outliers that I'd like to remove.

Problem: Remove outliers from a dataframe containing different types.

The DF looks like:

   df.dtypes
CONTRACT_TYPE                           object
CONTRACT_COC                            object
ORIGINATION_DATE                datetime64[ns]
MATURITY_DATE                   datetime64[ns]
ORIGINAL_TERM                          float64
REMAINING_TERM                           int64
INTEREST_RATE_INTERNAL                 float64
INTEREST_RATE_FUNDING                  float64

However, after trying this code as shown bellow, without success and even the zscore, I'm asking some help.

# Computing IQR
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df_out = df[~((df < (Q1 - 1.5 * IQR)) |(df > (Q3 + 1.5 * IQR))).any(axis=1)]

To summarize, I'd like to see in the plots (scatter, boxplot) a more 'normal' distribution without or with the less of outliers.

Please, do not hesitate if you need more information.

Solution

First of all, I assume that your data distribution is Normal. Here is a great strategy for removing outliers.

Make a Pandas Dataframe with all numeric features, which has outliers.

Use sklearn.preprocessing.StandardScaler on your Dataframe. It standardize features by removing the mean and scaling to unit variance. The implementation is as easy as follows;

# Declare Sklearn standard_scaler
standard_scaler = StandardScaler(copy=True, with_mean=True, with_std=True)        

# Fitting
standard_scaler.fit(x_train_df)        

# Transforming
x_train_normal_scaled_df = standard_scaler.transform(x_train_df)        

# Fitting and Transforming together 
x_train_normal_scaled_df = x_scaler_lev1.fit_transform(x_train_df)        

# Inverting the transformed data back.
x_train_df = standard_scaler.inverse_transform()

print(x_train_normal_scaled_df.describe())
x_train_normal_scaled_df.plot()

You should find out how much of your data is outlier. Empirical Rule of Normal Distribution can help here.

Experimentally, I always choose the data in the range of 3 times of standard deviation as my main data and out of this range would be the outlier. Normal distribution would guarantee that the main data have about 99.73% of information.