I am working on a binary classification problem for which I have an unbalanced dataset. I want to create a new more balanced dataset with 50% of observation in each class. For this, I am using SMOTE algorithm in R provided by DMwR library
.
In the new dataset, I want to keep constant the observations of the majority class.
However, I meet two problems:
Let assume that I have 20 observations: 17 observation in the majority class et only 3 observations in the minority class. Here my code:
library(DMwR)
library(dplyr)
sample_data <- data.frame(matrix(rnorm(200), nrow=20))
sample_data[1:17,"X10"] <- 0
sample_data[18:20,"X10"] <- 1
sample_data[,ncol(sample_data)] <- factor(sample_data[,ncol(sample_data)], levels = c('1','0'), labels = c('Yes','No'))
newDataSet <- SMOTE(X10 ~., sample_data, perc.over = 400, perc.under = 100)
In my code, I fixed the perc.over = 400
to create 12 new observations of the minority class, and I fixed perc.under = 100
to keep no change in the majority class.
However, when I check the newDataSet, I observe that SMOTE reduces the number of the majority class from 17 to 12. In addition, some generated observations have NA value.
The following image shows the obtained result:
According to ?SMOTE
:
for each case in the original data set belonging to the minority class, perc.over/100 new examples of that class will be created.
Moreover:
For instance, if 200 new examples were generated for the minority class, a value of perc.under of 100 will randomly select exactly 200 cases belonging to the majority classes from the original data set to belong to the final data set.
Therefore, in your case you are:
Yes
(besides the original ones).No
.The new Yes
containing NA might be related to the k
paramenter of SMOTE
. According to ?SMOTE
:
k: A number indicating the number of nearest neighbours that are used to generate the new examples of the minority class.
Its default value is 5, but in your original data you have only 3 Yes
. Setting k = 2
seems to solve this issue.
A final comment: to achieve your goal I would use SMOTE
only to increase the number of observations from the minority class (with perc.over
= 400 or 500). Then, you can combine them with the original observations from the majority class.