How to use knn classification (class package) using training and test datasets

Dfcensus is the original data frame. I am trying to use Sex, EducYears and Age to predict whether a person's Income is "<=50K" or ">50K".

There are 20,000 rows in x_train_auto (training set) and 12,561 in x_test_auto (test set).

My classification variable (training set) has 15,124 <=50k and 4876 >50k.

Here is my code:

predictions = knn(train = x_train_auto, # response
                  test  = x_test_auto, # response
                  cl = Df_census$Income[in_train_census], # prediction
                  k = 25)

table(predictions)
#<=50K 
#12561

As you can see, all 12,561 test samples were predicted to have an Income of ">=50K".

This doesn't make sense. I am not sure where I am going wrong.

P.S.: I have sex one-hot encodes as 0 for male and 1 for female. And I have scaled Educ_years and Age and added sex to the data frame. I then added the one-hot encoded sex variable back into the scaled test and train data.

Solution

identifying the problem

Your provided x_test-auto.csv data suggests that you passed logical vectors with TRUEs and FALSEs (which define the indices of training and test samples rather than the actual data) to the train and test arguments of class::knn.

the solution

Rather, use the logical vector in x_train_auto (which I believe corresponds to in_train_census in your example) to define two separate data.frames, each containing all your desired predictors. These are then the training and the test set.

p <- c("Age","EducYears","Sex")
Df_train <- Df_census[in_train_census,p]
Df_test <- Df_census[!in_train_census,p]

In the knn function, pass the training set to the train argument, and the test set to the test argument, and further pass the outcome / target variable of the training set (as a factor) to cl.

The output (see ?class::knn) will be the predicted outcome for the test set.

Here is a complete and reproducible workflow using your data.

the data

library(class)

# read data from Dropbox
x_train_auto <- read.csv("https://dropbox.com/s/6kupkp4u4qyizy7/x_test_auto.csv?dl=1", row.names = 1)
Df_census <- read.csv("https://dropbox.com/s/ccvck8ajnatmpv0/Df_census.csv?dl=1", row.names = 1, stringsAsFactors = TRUE)

table(x_train_auto) # TRUE are training, FALSE are test set
#> x_train_auto
#> FALSE  TRUE 
#> 12561 20000
str(Df_census) # Income as factor, Sex is binary, Age and EducYears are numeric
#> 'data.frame':    32561 obs. of  15 variables:
#>  $ Age          : int  39 50 38 53 28 37 49 52 31 42 ...
#>  $ Work         : Factor w/ 9 levels "?","Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
#>  $ Fnlwgt       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
#>  $ Education    : Factor w/ 16 levels "10th","11th",..: 10 10 12 2 10 13 7 12 13 10 ...
#>  $ EducYears    : int  13 13 9 7 13 14 5 9 14 13 ...
#>  $ MaritalStatus: Factor w/ 7 levels "Divorced","Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
#>  $ Occupation   : Factor w/ 15 levels "?","Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
#>  $ Relationship : Factor w/ 6 levels "Husband","Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
#>  $ Race         : Factor w/ 5 levels "Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
#>  $ Sex          : int  1 1 1 1 0 0 0 1 0 1 ...
#>  $ CapitalGain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
#>  $ CapitalLoss  : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ HoursPerWeek : int  40 13 40 40 40 40 16 45 50 40 ...
#>  $ NativeCountry: Factor w/ 42 levels "?","Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
#>  $ Income       : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...

# predictors and response
p <- c("Age","EducYears","Sex")
y <- "Income"

# create data partition
in_train_census <- x_train_auto$x

Df_train <- Df_census[in_train_census,]
Df_test <- Df_census[!in_train_census,]

# check
dim(Df_train)
#> [1] 20000    15

dim(Df_test)
#> [1] 12561    15

table(Df_train$Income)
#> 
#> <=50K  >50K 
#> 15124  4876

using class::knn

The knn (k-nearest-neighbors) algorithm can perform better or worse depending on the choice of the hyperparameter k. It's often difficult to know which k value is best for the classification of a particular dataset. In a machine learning setting, you'd want to try out different values of k to find a value that gives the highest performance on your test dataset (i.e., data which was not used for model fitting).

It's always important to strike a good balance between overfitting (model is too complex, and will give good results on the training data, but less accurate or even rubbish results on new test data) and underfitting (model is too trivial to explain the actual patterns in the data). In the case of knn, using a larger k value would probably better safeguard against overfitting, according to the explanations here.

# apply knn for various k using the given training / test set
r <- data.frame(array(NA, dim = c(0, 2), dimnames = list(NULL, c("k","accuracy"))))

for (k in 1:30) {
  
  #cat("k =", k, "\n")
  
  # fit model on training set, predict test set data
  set.seed(60402) # to be reproducible
  predictions <- knn(train = Df_train[,p],
                     test = Df_test[,p],
                     cl = Df_train[,y],
                     k = k)
  
  # confusion matrix on test set
  t <- table(pred = predictions, ref = Df_test[,y])
  
  # accuracy
  a <- sum(diag(t)) / sum(t)
  
  # bind
  r <- rbind(r, data.frame(k = k, accuracy = a))
}

visualize model assessment

# find best k
r[which.max(r$accuracy),]
#>     k  accuracy
#> 17 17 0.8007324

(k.best <- r[which.max(r$accuracy),"k"])
#> [1] 17

# plot
with(r, plot(k, accuracy, type = "l"))
abline(v = k.best, lty = 2)

^{Created on 2021-09-23 by the reprex package (v2.0.1)}

interpretation

The loop results suggest that your optimal value of k for this particular training and test set is between 12 and 17 (see plot above), but the accuracy gain is very small compared to using k = 1 (it's at around 80% regardless of k).

additional thoughts

Given that high income is rarer compared to lower income, accuracy might not be the desired performance metric. Sensitivity might be equally or more important, and you could modify the example code to calculate and assess other performance metrics instead.

In addition to pure prediction, you might want to explore whether other variables could be informative predictors of the Income class, by adding them to the p vector and comparing the resulting accuracies.

Here, we base our conclusions on a particular realization of training and test data. Better machine learning practice would be to split your data into 2 (as here), but then repeatedly split the training set again to fit and assess many more models, using e.g. (repeated) k-fold cross validation. A good package to do this in R is e.g. caret or tidymodels.

To gain a better understanding regarding which variables are the best predictors of Income class, I would also carry out a logistic regression on various uncorrelated predictors.