Search code examples
rknn

How can I perform bootstrap to find the confidence interval for a k-nn model in R?


I have a training df with 2 columns like

   a     b
1  1000  20
2  1008  13
...
n  ...   ...

Now, as I am required to find a 95% CI for the estimate of 'b' based on a specific 'a' value, with a 'k' value of my choice and compare the CI result to other specific value of 'k's. My question is how can I perform bootstrap for this with 1000 bootstrap reps as I am required to use a fitted knn model for the training data with kernel = 'gaussian' and k can only be in range 1-20 ? I have found that the best k for this model is k = 5, and had a go for bootstrap but it doesn't work

library(kknn)
library(boot)

boot.kn = function(formula, data, indices)
{
  # Create a bootstrapped version
  d = data[indices,]
  
  # Fit a model for bs
  fit.kn =  fitted(train.kknn(formula,data, kernel= "gaussian", ks = 5))
  
  # Do I even need this complicated block
  target = as.character(fit.kn$terms[[2]])
  rv = my.pred.stats(fit.kn, d[,target])
  return(rv)
}
bs = boot(data=df, statistic=boot.kn, R=1000, formula=b ~ a)
boot.ci(bs,conf=0.95,type="bca")

Please inform me for more info if I'm not clear enough. Thank you.


Solution

  • Here is a way to regress b on a with the k-nearest neighbors algorithm.

    First, a data set. This is a subset of the iris data set, keeping the first two columns. One row is removed to later be the new data.

    i <- which(iris$Sepal.Length == 5.3)
    df1 <- iris[-i, 1:2]
    newdata <- iris[i, 1:2]
    names(df1) <- c("a", "b")
    names(newdata) <- c("a", "b")
    

    Now load the packages to be used and determine the optimal value for k with package kknn.

    library(caret)
    library(kknn)
    library(boot)
    
    fit <- kknn::train.kknn(
      formula = b ~ a,
      data = df1,
      kmax = 15,
      kernel = "gaussian",
      distance = 1
    )
    k <- fit$best.parameters$k
    k
    #[1] 9
    

    And bootstrap predictions for the new point a <- 5.3.

    boot.kn <- function(data, indices, formula, newdata, k){
      d <- data[indices, ]
      fit <- knnreg(formula, data = d)
      predict(fit, newdata = newdata)
    }
    
    set.seed(2021)
    R <- 1e4
    bs <- boot(df1, boot.kn, R = R, formula = b ~ a, newdata = newdata, k = k)
    ci <- boot.ci(bs, level = 0.95, type = "bca")
    
    ci
    #BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
    #Based on 10000 bootstrap replicates
    #
    #CALL : 
    #boot.ci(boot.out = bs, type = "bca", level = 0.95)
    #
    #Intervals : 
    #Level       BCa          
    #95%   ( 3.177,  3.740 )  
    #Calculations and Intervals on Original Scale
    

    Plot the results.

    old_par <- par(mfrow = c(2, 1),
                   oma = c(5, 4, 0, 0) + 0.1,
                   mar = c(1, 1, 1, 1) + 0.1)
    
    hist(bs$t, main = "Histogram of bootstrap values")
    abline(v = 3.7, col = "red")
    abline(v = mean(bs$t), col = "blue")
    abline(v = ci$bca[4:5], col = "blue", lty = "dashed")
    
    plot(b ~ a, df1)
    points(5.3, 3.7, col = "red", pch = 19)
    points(5.3, mean(bs$t), col = "blue", pch = 19)
    arrows(x0 = 5.3, y0 = ci$bca[4],
           x1 = 5.3, y1 = ci$bca[5],
           col = "blue", angle = 90, code = 3)
    
    par(old_par)
    

    enter image description here