statistics data-science sampling oversampling imbalanced-data

What is correct way of sampling a highly imbalanced dataset which has low between feature correlation and low between class variance?

I have a dataset with 23 features with very low correlation. The two classes have low variance between the classes.

The classes are highly imbalanced like that of data available for fraud detection. What is suitable approach for sampling this kind of data ?

Solution

Thanks for coming to SO to ask your question!

Dealing with imbalanced data (defined, generally, as data where the number of cases for one or more classes is very different from the others -- a skewed distribution of some kind) is an ongoing challenge, and one that has prompted a lot of online writing. I like this article as a starting place. I'll be providing examples in Python, though similar ideas apply in R as well.

To provide a quick summary: Sampling is important for a number of reasons, not least of which is properly splitting your data for training and testing. To oversimplify, you can either change the way you draw examples from a dataset (sample) so that you have a roughly equal chance of getting each class, or you can try to simulate new examples of the class with fewer cases to again reach that equal probability of drawing a class when you do your splitting.

For clairity, let's say there are two cases for the variable X: X = 0 and X = 1. And let's call X = 1 the case where some event happens, some characteristic is present, some response is observed, etc. We'll call that the "positive class". Finally, let's say you have 100,000 observations, with only 1,000 cases of X = 1, the rest being X = 0. Thus, your minority class is the positive class, and your imbalance (positive to negative) is 1/100.

If you are drawing 50,000 random samples, and would like the share to be roughly 50/50 positive and negative class, you can do a couple things.

Oversample the minority class

This method has you drawing more examples from the data where X = 1. To reach the 50/50 balance, you need to draw (randomly) more times from the positive class in order to reach 25,000 examples.

To do this with sci-kit learn, you could do something like the following.

Assuming X is a dataframe with your data:

from sklearn.utils import resample

# make two dataframes, each with only one class 
majority_df = X[X['outcome']==0]
minority_df = X[X['outcome']==1]

# Oversampling the minority
oversampled_minority_df = resample(minority_df,
                          replace=True, 
                          n_samples=len(majority_df), 
                          random_state=123)

A few comments:

"Resampling" is the processing of pulling data from a set over and over again
replace says that you want the process to "put back" the observation it pulled; in this case it just means that the system can grab the same observation multiple times during the resampling (as though it were being put back in a bag for someone to grab)
n_samples is the same length as the majority class dataframe so that the end result has an even balance between majority/minority examples
1. Undersample the majority class

Now that you know about oversampling the minority class, this is just the opposite. Instead of repeatedly sampling the minority class until you have the same number of examples as the majority, here you only take as many samples of the majority class as you have examples of the minority class.

You can reverse the above code as follows. Still assuming X is your data:

# Oversampling the minority
undersampled_majority_df = resample(majority_df,
                           replace=False, 
                           n_samples=len(minority_df), 
                           random_state=123)

A couple of notes:

This is still "resampling", only taking fewer samples
replace is now false, since you don't want to repeat data if you don't have to (which you did have to do for oversampling)
n_samples now matches the length of the minority_df so that there are equal numbers of examples

Both over- and under-sampling classes carry statistical concerns that you can look into elsewhere.

The other option is to synthesize data. This, again to oversimplify the statistical process, perturbs the data you have in such a way as to make "new" examples that look similar to your existing data, but introduce some (useful) noise into the process.

One popular package for dealing with imbalanced data and synthetic data creation is imblearn. There is a lot of great work in this package on its own, but even better is how similar it is to sklearn and how well the two work together.

imblearn provides the popular method SMOTE, or Synthetic Minority-Oversampling TEchnique, along with many others. In this case, however, instead of working directly with your dataframes, imblearn employs SMOTE as a fitting process of its own.

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

y = base_df['outcome']
X = base_df.drop('outcome', axis=1)

# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    random_state=123)

sm = SMOTE(random_state=123, ratio=1.0)
X_train, y_train = sm.fit_sample(X_train, y_train)

You'll note that the sm object has the fit_sample method, and that you set the proportion of positive/negative (through ratio) when instantiating it. The result are dataframes that are balanced and usable during model fitting.