I am having some difficulties creating a confusion matrix to compare my model prediction to the actual values. My data set has 159 explanatory variables and my target is called "classe".
#Load Data
df <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings=c("NA","#DIV/0!",""))
#Split into training and validation
index <- createDataPartition(df$classe, times=1, p=0.5)[[1]]
training <- df[index, ]
validation <- df[-index, ]
#Model
decisionTreeModel <- rpart(classe ~ ., data=training, method="class", cp =0.5)
#Predict
pred1 <- predict(decisionTreeModel, validation)
#Check model performance
confusionMatrix(validation$classe, pred1)
The following error message is generated from the code above:
Error in confusionMatrix.default(validation$classe, pred1) :
The data must contain some levels that overlap the reference.
I think it may have something to do with the pred1 variable that the predict function generates, it's a matrix with 5 columns while validation$classe is a factor with 5 levels. Any ideas on how to solve this?
Thanks in advance
Your prediction is giving you a matrix of probabilities for each class. If you want to be returned the "winner" (predicted class), replace your predict line with this:
pred1 <- predict(decisionTreeModel, validation, type="class")