I have a dataset with 23 features with very low correlation. The two classes have low variance between the classes.
The classes are highly imbalanced like that of data available for fraud detection. What is suitable approach for sampling this kind of data ?
Thanks for coming to SO to ask your question!
Dealing with imbalanced data (defined, generally, as data where the number of cases for one or more classes is very different from the others -- a skewed distribution of some kind) is an ongoing challenge, and one that has prompted a lot of online writing. I like this article as a starting place. I'll be providing examples in Python, though similar ideas apply in R
as well.
To provide a quick summary: Sampling is important for a number of reasons, not least of which is properly splitting your data for training and testing. To oversimplify, you can either change the way you draw examples from a dataset (sample) so that you have a roughly equal chance of getting each class, or you can try to simulate new examples of the class with fewer cases to again reach that equal probability of drawing a class when you do your splitting.
For clairity, let's say there are two cases for the variable X
: X = 0
and X = 1
. And let's call X = 1
the case where some event happens, some characteristic is present, some response is observed, etc. We'll call that the "positive class". Finally, let's say you have 100,000 observations, with only 1,000 cases of X = 1
, the rest being X = 0
. Thus, your minority class is the positive class, and your imbalance (positive to negative) is 1/100.
If you are drawing 50,000 random samples, and would like the share to be roughly 50/50 positive and negative class, you can do a couple things.
This method has you drawing more examples from the data where X = 1
. To reach the 50/50 balance, you need to draw (randomly) more times from the positive class in order to reach 25,000 examples.
To do this with sci-kit learn
, you could do something like the following.
Assuming X
is a dataframe with your data:
from sklearn.utils import resample
# make two dataframes, each with only one class
majority_df = X[X['outcome']==0]
minority_df = X[X['outcome']==1]
# Oversampling the minority
oversampled_minority_df = resample(minority_df,
replace=True,
n_samples=len(majority_df),
random_state=123)
A few comments:
replace
says that you want the process to "put back" the observation it pulled; in this case it just means that the system can grab the same observation multiple times during the resampling (as though it were being put back in a bag for someone to grab)n_samples
is the same length as the majority class dataframe so that the end result has an even balance between majority/minority examples
Now that you know about oversampling the minority class, this is just the opposite. Instead of repeatedly sampling the minority class until you have the same number of examples as the majority, here you only take as many samples of the majority class as you have examples of the minority class.
You can reverse the above code as follows. Still assuming X
is your data:
# Oversampling the minority
undersampled_majority_df = resample(majority_df,
replace=False,
n_samples=len(minority_df),
random_state=123)
A couple of notes:
replace
is now false, since you don't want to repeat data if you don't have to (which you did have to do for oversampling)n_samples
now matches the length of the minority_df
so that there are equal numbers of examplesBoth over- and under-sampling classes carry statistical concerns that you can look into elsewhere.
The other option is to synthesize data. This, again to oversimplify the statistical process, perturbs the data you have in such a way as to make "new" examples that look similar to your existing data, but introduce some (useful) noise into the process.
One popular package for dealing with imbalanced data and synthetic data creation is imblearn
. There is a lot of great work in this package on its own, but even better is how similar it is to sklearn
and how well the two work together.
imblearn
provides the popular method SMOTE, or Synthetic Minority-Oversampling TEchnique, along with many others. In this case, however, instead of working directly with your dataframes, imblearn
employs SMOTE as a fitting process of its own.
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
y = base_df['outcome']
X = base_df.drop('outcome', axis=1)
# setting up testing and training sets
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.20,
random_state=123)
sm = SMOTE(random_state=123, ratio=1.0)
X_train, y_train = sm.fit_sample(X_train, y_train)
You'll note that the sm
object has the fit_sample
method, and that you set the proportion of positive/negative (through ratio
) when instantiating it. The result are dataframes that are balanced and usable during model fitting.