Search code examples
rprediction

Difference in size of training sample size and test prediction


I have trained a model on a sample size of 2120x10. Now I'm trying to apply the same model to the test data set and I'm having trouble deriving the confusion matrix.

test_predictions <- predict(train_obj, test_data)
test_predictions <- ifelse(test_predictions > 5, 1, 0)
confusionMatrix(as.factor(test_predictions), test_data$outcome, positive="1")

I get an error when calculating the confusion matrix as test_data$outcome has 2135 values. If I use test_data$outcome[1:2120], everything works fine.

Is there a better way to calculate the confusion matrix without restricting the number of values.? Is it correct to restrict the number of values in test_data$outcome?


Solution

  • That doesn't sound right. How can test_data$outcome have 2135 values if test_data only has 2120 rows? Even if there are NAs in test_data's predictors, they will be predicted as NA and then ignored by confusionMatrix.

    dat=data.frame(a=rnorm(1000), b=rnorm(1000))
    dat=dat %>% 
      mutate(c=5*(a+b)) %>%
      mutate(d=ifelse(c>5, 1, 0))
    set.seed(1)
    i=sample(1:1000, 750, replace=FALSE)
    train_data=dat[i,]
    test_data=dat[-i,]
    test_data[sample(1:250, 3),1:2]=NA
    lr=lm(c ~ a + b, data=train_data)
    test_predictions=predict(lr, test_data)
    test_predictions=ifelse(test_predictions>5, 1, 0)
    confusionMatrix(test_predictions, test_data$d)
    
              Reference
    Prediction   0   1
             0 187   0
             1   0  60