Search code examples
rdecision-treeconfusion-matrix

'factors with the same levels' in Confusion Matrix


I'm trying to make a decision tree but this error comes up when I make a confusion matrix in the last line :

Error : `data` and `reference` should be factors with the same levels

Here's my code:

library(rpart)
library(caret)
library(dplyr)
library(rpart.plot)
library(xlsx)
library(caTools)
library(data.tree)
library(e1071)

#Loading the Excel File
library(readxl)
FINALDATA <- read_excel("Desktop/FINALDATA.xlsm")
View(FINALDATA)
df <- FINALDATA
View(df)

#Selecting the meaningful columns for prediction
#df <- select(df, City, df$`Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)
df <- select(df, City, `Customer type`, Gender, Quantity, Total, Date, Time, Payment, Rating)

#making sure the data is in the right format 
df <- mutate(df, City= as.character(City), `Customer type`= as.character(`Customer type`), Gender= as.character(Gender), Quantity= as.numeric(Quantity), Total= as.numeric(Total), Time= as.numeric(Time), Payment = as.character(Payment), Rating= as.numeric(Rating))

#Splitting into training and testing data
set.seed(123)
sample = sample.split('Customer type', SplitRatio = .70)
train = subset(df, sample==TRUE)
test = subset(df, sample == FALSE)

#Training the Decision Tree Classifier
tree <- rpart(df$`Customer type` ~., data = train)

#Predictions
tree.customertype.predicted <- predict(tree, test, type= 'class')

#confusion Matrix for evaluating the model
confusionMatrix(tree.customertype.predicted, test$`Customer type`)

So I've tried to do this as said in another topic:

confusionMatrix(table(tree.customertype.predicted, test$`Customer type`))

But I still have an error:

Error in !all.equal(nrow(data), ncol(data)) : argument type is invalid

Solution

  • I made a toy data set and examined your code. There were a couple issues:

    1. R has a easier time with variable names that follow a certain style. Your 'Customer type' variable has a space in it. In general, coding is easier when you avoid spaces. So I renamed it 'Customer_type". For your data.frame you could simply go into the source file, or use names(df) <- gsub("Customer type", "Customer_type", names(df)).
    2. I coded 'Customer_type' as a factor. For you this will look like df$Customer_type <- factor(df$Customer_type)
    3. The documentation for sample.split() says the first argument 'Y' should be a vector of labels. But in your code you gave the variable name. The labels are the names of the levels of the factor. In my example these levels are High, Med and Low. To see the levels of your variable you could use levels(df$Customer_type). Input these to sample.split() as a character vector.
    4. Adjust the rpart() call as shown below.

    With these adjustments, your code might be OK.

    # toy data
    df <- data.frame(City = factor(sample(c("Paris", "Tokyo", "Miami"), 100, replace = T)),
                     Customer_type = factor(sample(c("High", "Med", "Low"), 100, replace = T)),
                     Gender = factor(sample(c("Female", "Male"), 100, replace = T)),
                     Quantity = sample(1:10, 100, replace = T),
                     Total = sample(1:10, 100, replace = T),
                     Date = sample(seq(as.Date('2020/01/01'), as.Date('2020/12/31'), by="day"), 100),
                     Rating = factor(sample(1:5, 100, replace = T)))
    
    library(rpart)
    library(caret)
    library(dplyr)
    library(caTools)
    library(data.tree)
    library(e1071)
    
    #Splitting into training and testing data
    set.seed(123)
    sample = sample.split(levels(df$Customer_type), SplitRatio = .70) # ADJUST YOUR CODE TO MATCH YOUR FACTOR LABEL NAMES
    train = subset(df, sample==TRUE)
    test = subset(df, sample == FALSE)
    
    #Training the Decision Tree Classifier
    tree <- rpart(Customer_type ~., data = train) # ADJUST YOUR CODE SO IT'S LIKE THIS
    
    #Predictions
    tree.customertype.predicted <- predict(tree, test, type= 'class')
    
    #confusion Matrix for evaluating the model
    confusionMatrix(tree.customertype.predicted, test$Customer_type)