Search code examples
rloopsdataframeknn

R: for loop while subsetting a dataframe


I am performing text classification, I have created features and I have multiple labels to train and predict which are basically the binary variables, which I want to predict.

Here is my code, and log of the error.

for (col in colnames(train_data)){
  train_label <- train_data[,c(col)]
  test_pred <- knn(train = train_mat[ ,!(colnames(train_mat) == "Sentiment")], test = test_mat[ ,!(colnames(test_mat) == "Sentiment")], cl = as.factor(train_label), k=6)

  table(test_pred,test_data[, col])
  acc.RF = mean(test_pred==test_data[, col])
  acc.RF
  confusionMatrix(table(test_pred,test_data[, col]))
}

Error in knn(train = train_mat[, !(colnames(train_mat) == "Sentiment")],  : 
  'train' and 'class' have different lengths
  1. train/test_data = original dataframe
  2. train/test_mat = TFIDF features the original target variables are present in the train/test_data.

I am getting the following error.

Sentiment is a variable, which is main to predict, but i want to train with all variables present in the train/test original df.

Please, in train/test_mat, I have appended Sentiment column, so I am excluding it while feeding features to KNN.


Solution

  • Consider Map, the wrapper to mapply and build a list of confusion matrices passing each column from test and train data elementwise. Also, consider transform in removing Sentiment:

    matrix_process <- function(test_label, train_label) {
    
      test_pred <- knn(train = transform(train_mat, Sentiment = NULL), 
                       test = transform(test_mat, Sentiment = NULL), 
                       cl = as.factor(train_label), k=6)
    
      print(table(test_pred, test_label))
      acc.RF = mean(test_pred == test_label)
      print(acc.RF)
    
      return(confusionMatrix(table(test_pred, test_label)))    
    }
    
    conf_matrix_list <- Map(matrix_process, test_data, train_data)
    
    # EQUIVALENTLY:
    conf_matrix_list <- mapply(matrix_process, test_data, train_data, SIMPLIFY=FALSE)