I have this code (below) and need to use CARET and split the data set in 40% of all data in the dataset should be in trainset, the rest in testset; the payment variable should be distributed equally across the split but the code of the confusionmatrixline gives an error which says:
"Error: data and reference should be factors with the same levels."
EDIT: the payment variable is a binominal variable so 0 (no) and 1 (yes). gdp are just numbers
Sample dataset: (don't now how to make a table here yet)
payment gdp
0 838493
1 9303032
0 72738
1 38300022
1 283283
How to fix this??
My code:
`index <- createDataPartition(y = dataset$payment, p = 0.40, list = F)
trainset <- dataset[index, ]
testset <- dataset[-index, ]
payment_knn <- train(payment ~ gdp, method = "knn", data = trainset,
trControl = trainControl(method = 'cv', number = 5))
predicted_outcomes <- predict(payment_knn, testset)
conMX_pay <- confusionMatrix(predicted_outcomes, testset$payment)
conMX_pay `
This is purely for illustration purposes. Make sure test data is the same as train data.
df<-df %>%
mutate(payment=as.factor(payment),gdp=as.numeric(gdp))
metric<-"Accuracy"
control<-trainControl(method="cv",number = 10)
train_set<-createDataPartition(df$payment,p=0.8,list=F)
valid_me<-df[-train_set,]
train_me<-df[train_set,]
#Training
set.seed(233)
fit.knn<-train(payment~.,method="knn",data=train_me,metric=metric,trControl=control)
validated<-predict(fit.knn,valid_me)
confusionMatrix(validated,valid_me$payment)
This works fine given the data in your question. Warnings because the data set is too small. Purely for illustration. Data Used:
payment gdp
1 0 838493
2 1 9303032
3 0 72738
4 1 38300022
5 1 283283
Cheers!