I'm trying to make a decision tree but this error comes up when I make a confusion matrix in the last line :
Error : `data` and `reference` should be factors with the same levels
Here's my code:
library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)
#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)
#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
#making sure the data is in the right format
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))
#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)
So I've tried to do this as said in another topic:
confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))
But I still have an error:
Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid
I made a toy data set and examined your code. There were a couple issues:
names(df) <- gsub("Customer type", "Customer_type", names(df))
.df$Customer_type <- factor(df$Customer_type)
sample.split()
says the first argument 'Y' should be a vector of labels. But in your code you gave the variable name. The labels are the names of the levels of the factor. In my example these levels are High, Med and Low. To see the levels of your variable you could use levels(df$Customer_type)
. Input these to sample.split()
as a character vector.rpart()
call as shown below.With these adjustments, your code might be OK.
# toy data
df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
Quantity = sample(1:10, 100, replace = T),
Total = sample(1:10, 100, replace = T),
Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
Rating = factor(sample(1:5, 100, replace = T)))
library(rpart)
library(caret)
library(dplyr)
library(caTools)
library(data.tree)
library(e1071)
#Splitting into training and testing data
set.seed(123)
sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)
#Training the Decision Tree Classifier
tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS
#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')
#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$Customer_type)