Search code examples
rmachine-learningimbalanced-datasmote

Create balanced dataset 1:1 using SMOTE without modifying the observations of the majority class in R


I am working on a binary classification problem for which I have an unbalanced dataset. I want to create a new more balanced dataset with 50% of observation in each class. For this, I am using SMOTE algorithm in R provided by DMwR library.

In the new dataset, I want to keep constant the observations of the majority class.

However, I meet two problems:

  1. SMOTE reduces or increases the number of observations of the majority class (I want only increase the number of the minority class).
  2. Some observations generated by SMOTE contains NA values.

Let assume that I have 20 observations: 17 observation in the majority class et only 3 observations in the minority class. Here my code:

library(DMwR)
library(dplyr)

sample_data <- data.frame(matrix(rnorm(200), nrow=20))
sample_data[1:17,"X10"] <- 0
sample_data[18:20,"X10"] <- 1
sample_data[,ncol(sample_data)] <- factor(sample_data[,ncol(sample_data)], levels = c('1','0'), labels = c('Yes','No'))
newDataSet <- SMOTE(X10 ~., sample_data, perc.over = 400, perc.under = 100)

In my code, I fixed the perc.over = 400 to create 12 new observations of the minority class, and I fixed perc.under = 100 to keep no change in the majority class.

However, when I check the newDataSet, I observe that SMOTE reduces the number of the majority class from 17 to 12. In addition, some generated observations have NA value.

The following image shows the obtained result:

enter image description here


Solution

  • According to ?SMOTE:

    for each case in the original data set belonging to the minority class, perc.over/100 new examples of that class will be created.

    Moreover:

    For instance, if 200 new examples were generated for the minority class, a value of perc.under of 100 will randomly select exactly 200 cases belonging to the majority classes from the original data set to belong to the final data set.

    Therefore, in your case you are:

    1. creating 12 new Yes (besides the original ones).
    2. randomly selecting 12 No.

    The new Yes containing NA might be related to the k paramenter of SMOTE. According to ?SMOTE:

    k: A number indicating the number of nearest neighbours that are used to generate the new examples of the minority class.

    Its default value is 5, but in your original data you have only 3 Yes. Setting k = 2 seems to solve this issue.

    A final comment: to achieve your goal I would use SMOTE only to increase the number of observations from the minority class (with perc.over = 400 or 500). Then, you can combine them with the original observations from the majority class.