Search code examples
rknntidymodels

Why does the `class::knn()` function give different results from `kknn::kknn()` with a fixed k?


I am trying to convert the base R code in Introduction to Statistical Learning into the R tidymodels ecosystem. The book uses class::knn() and tidymodels uses kknn::kknn(). I got different results when doing knn, with a fixed k. So I stripped out the tidymodels and tried to just compare using class::knn() and kknn::kknn() and still I got different results. class::knn uses Euclidean distance and kknn::kknn uses Minkowski distance with distance parameter of 2, which is Euclidean distance according to Wikipedia. I set the kernel in kknn to be "rectangular" which according to the documentation is unweighted. Shouldn't the results of knn modeling with a fixed k be the same?

Here is (basically) the base R with class::knn code from the book:

library(ISLR2)

# base R class
train <- (Smarket$Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
Direction.2005 <- Smarket$Direction[!train]

train.X <- cbind(Smarket$Lag1, Smarket$Lag2)[train, ]
test.X <- cbind(Smarket$Lag1, Smarket$Lag2)[!train, ]
train.Direction <- Smarket$Direction[train]

the_k <- 3 # 30 shows larger discrepancies

library(class)
knn.pred <- knn(train.X, test.X, train.Direction, k = the_k)

Here is my tidyverse with kknn::kknn code

# tidyverse kknn
library(tidyverse)
Smarket_train <- Smarket %>%
  filter(Year != 2005)

Smarket_test <- Smarket %>%  # Smarket.2005
  filter(Year == 2005)

library(kknn)
the_knn <- 
  kknn(
    Direction ~ Lag1 + Lag2, Smarket_train, Smarket_test, k = the_k,
    distance = 2, kernel = "rectangular"
  )

fit <- fitted(the_knn)

This shows the differences:

the_k
# class
table(Direction.2005, knn.pred)
# kknn
table(Smarket_test$Direction, fit)

Did I make a stupid mistake in the coding? If not, can anybody explain the differences between class::knn() and kknn::kknn()?


Solution

  • Alright, there is a lot going on in this one. First, we see from the documentation of class::knn() that the classification is decided by majority vote, with ties broken at random. So it appears we should start by looking at the output of class::knn() to see what happens.

    I repeatedly called

    which(fitted(knn.pred) != fitted(knn.pred))
    

    and after a while, I got 28 and 66. So these are the observations in the test data set that has some randomness in them. To see why these two observations are troublesome, we can set prob = TRUE in class::knn() to get the predicted probabilities.

    knn.pred <- knn(train.X, test.X, train.Direction, k = the_k, prob = TRUE)
    
    attr(knn.pred, "prob")
    #>   [1] 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 1.0000000 0.6666667
    #>   [8] 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 1.0000000
    #>  [15] 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667
    #>  [22] 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.6666667 0.5000000
    #>  [29] 0.6666667 0.6666667 1.0000000 0.6666667 0.6666667 0.6666667 0.6666667
    #>  [36] 1.0000000 0.6666667 0.6666667 0.6666667 1.0000000 1.0000000 1.0000000
    #>  [43] 0.6666667 0.6666667 0.6666667 0.6666667 1.0000000 0.6666667 1.0000000
    #>  [50] 1.0000000 0.6666667 1.0000000 0.6666667 0.6666667 1.0000000 1.0000000
    #>  [57] 0.6666667 0.6666667 0.6666667 1.0000000 0.6666667 0.6666667 0.6666667
    #>  [64] 0.6666667 1.0000000 0.5000000 0.6666667 1.0000000 0.6666667 1.0000000
    ...
    

    And here we see that the predicted probability for observations 28 and 66 are both 0.5. But how could that be since we are having k=3?

    To answer that, we will take a look at the nearest neighbors to these points. I'm going to use the RANN::nn2() function to calculate the distances between the training set and testing set. Let us look at the first observation as an example, we calculate the distances and pull them out

    dists <- RANN::nn2(train.X, test.X)
    
    dists$nn.dists[1, ]
    #>  [1] 0.01063015 0.05632051 0.06985700 0.08469357 0.08495881 0.08561542
    #>  [7] 0.10823123 0.12003333 0.12621014 0.12657014
    

    The distances by themselves don't do much, what we want to know is what observations in the training set they are and their classes.

    We can pull this out with $nn.idx

    dists$nn.idx[1, ]
    #>  [1] 503 411 166 964 981 611 840 705 562 578
    
    train.Direction[dists$nn.idx[1, 1:3]]
    #> [1] Up   Down Down
    #> Levels: Down Up
    

    And we see here that the nearest neighbors to the first observations are Up, Down, and Down. Thus giving a classification of Down.

    If we look at the 66th observation we see something different. Notice how the 3rd and 4th nearest neighbors have the exact same distance?

    dists$nn.dists[66, ]
    #>  [1] 0.06500000 0.06754258 0.07465253 0.07465253 0.07746612 0.07778175
    #>  [7] 0.08905055 0.09651943 0.11036757 0.11928118
    train.Direction[dists$nn.idx[66, 1:4]]
    #> [1] Down Down Up   Up  
    #> Levels: Down Up
    

    And when we look at their classes there are 2 Up and 2 Down. And this is where the discrepancy comes in. class::knn() count all these 4 observations as the "3 nearest neighbors", which gives a tie, that is broken randomly. kknn::kknn() Takes the first 3 neighbors, disregarding this tie in distances, and predicts Down since the first 3 neighbors have 2 Down and 1 Up.

    predict(the_knn, type = "prob")[66, ]
    #>           Down        Up
    #> [1,] 0.6666667 0.3333333