Search code examples
rcross-validationknnconfusion-matrix

Trying to create confusion matrix from cross-validated results using the best value of k in R


I have wrote the knn cross validation method below using the iris dataset in R. How would I get the best value of k from this and create a confusion matrix based on this? Any help would be great.

library(class)
data("iris")
kfolds = 5
iris$folds = cut(seq(1,nrow(iris)),breaks=kfolds,labels=FALSE)
iris$folds

# Sets the columns to use as predicators
pred = c("Petal.Width", "Petal.Length")
accuracies = c()
ks = c(1,3,5,7,9,11,13,15)
for (k in ks) {
  k.accuracies = c()
  for(i in 1:kfolds) {
    # Builds the training set and test set for this fold.
    train.items.this.fold  = iris[iris$folds != i,] 
    validation.items.this.fold = iris[iris$folds == i,]

    # Fit knn model on this fold.
    predictions = knn(train.items.this.fold[,pred], 
                      validation.items.this.fold[,pred], 
                      train.items.this.fold$Species, k=k)

    predictions.table <- table(predictions, validation.items.this.fold$Species)

    # Work out the amount of correct and incorrect predictions.
    correct.list <- predictions == validation.items.this.fold$Species
    nr.correct = nrow(validation.items.this.fold[correct.list,])

    # Get accuracy rate of cv.
    accuracy.rate = nr.correct/nrow(validation.items.this.fold)

    # Adds the accuracy list.
    k.accuracies <- cbind(k.accuracies, accuracy.rate)
  }
  # Adds the mean accuracy to the total accuracy list.
  accuracies <- cbind(accuracies, mean(k.accuracies))
}

# Accuracy for each value of k: visualisation.
accuracies

Update:

predictions.table <- table(predictions == ks[which.max(accuracies)], validation.items.this.fold$Species)

Solution

  • Your code have some problems, this one runs:

    library(class)
    data("iris")
    kfolds = 5
    iris$folds = cut(seq(1,nrow(iris)),breaks=kfolds,labels=FALSE)
    iris$folds
    
    # Sets the columns to use as predicators
    pred = c("Petal.Width", "Petal.Length")
    accuracies = c()
    ks = c(1,3,5,7,9,11,13,15)
    k.accuracies = c()
    predictions.list = list()
    for (k in ks) {
      k.accuracies = c()
      for(i in 1:kfolds) {
        # Builds the training set and test set for this fold.
        train.items.this.fold  = iris[iris$folds != i,] 
        validation.items.this.fold = iris[iris$folds == i,]
    
        # Fit knn model on this fold.
        predictions = knn(train.items.this.fold[,pred], 
                          validation.items.this.fold[,pred], 
                          train.items.this.fold$Species, k=k)
        predictions.list[[i]] = predictions
    
        predictions.table <- table(predictions, validation.items.this.fold$Species)
    
        # Work out the amount of correct and incorrect predictions.
        correct.list <- predictions == validation.items.this.fold$Species
        nr.correct = nrow(validation.items.this.fold[correct.list,])
    
        # Get accuracy rate of cv.
        accuracy.rate = nr.correct/nrow(validation.items.this.fold)
    
        # Adds the accuracy list.
        k.accuracies <- cbind(k.accuracies, accuracy.rate)
      }
      # Adds the mean accuracy to the total accuracy list.
      accuracies <- cbind(accuracies, mean(k.accuracies))
    }
    accuracies
    
    
    predictions.table <- table(predictions.list[[which.max(accuracies)]], validation.items.this.fold$Species)  
    

    When you calling predictions.table <- table(predictions, validation.items.this.fold$Species), this is the confusion matrix, and you are using the accuracy as the evaluation metric, so the best K is the best accuracy. You can get the best K value like this:

    ks[which.max(accuracies)]
    

    UPDATE

    Create a list to store each prediction and then created the confusion matrix using the best accuracy.