Search code examples
rpredictionnar-caretgbm

Predicting new data with NAs with GBM in R


I have some data that generates NAs via a non-random process. Typically this involves either users not manually entering data or systematic issues with various automated jobs. Because of this GBM models are appealing to me due to their explicit handling of NA values, as opposed to imputation. However, I'm having issues getting GBM to output predictions for rows containing NA in my test sets. Here's a working example with Iris:

library(missForest)
library(caret)

set.seed(1)
iris.na <- prodNA(iris, noNA = 0.1)
iris.na$Species <- ifelse(iris.na$Species == "setosa", "setosa", "other")
iris.na$Species <- as.factor(iris.na$Species)

set.seed(1)
train.idx <- createDataPartition(y = iris.na$Species, p = .90, list = FALSE)
train <- iris.na[ train.idx,]
test <- iris.na[ -train.idx,]
rm(train.idx)

fitControl <- trainControl(method = "cv", number = 5)
#fitControl <- trainControl(method = "oob")
fit <- train(Species ~ ., data = train, method = "gbm",
         trControl = fitControl,
         verbose = FALSE)

Now in sample predictions work as I would expect for GBM and I receive one prediction for each row.

train.pred <- predict(fit, type="prob")
nrow(train)
#[1] 136
nrow(train.pred)
#[1] 136

However, moving to the out of sample test data does not return one prediction for each row. As you can see below, each row containing NA does not return a prediction.

test.pred <- predict(fit, newdata = test, type="prob")
nrow(test)
#[1] 14
nrow(test.pred)
#[1] 10

So it would seem that it's dropping NAs for predictions on new data. Ideally, I'd like to see a 1-1 relationship for predictions on each row within both the test and train data sets, but I'm at a loss as to why GBM would return this only for the training, but not testing set. Thanks for any help.


Solution

  • By default, predict.train will remove NAs: na.action = na.omit. You can see this by looking at the function (type predict.train in console). Also note that na.action is used only on newdata (!is.null(newdata) on line 16) not on train data.

    So, the solution is to add ,na.action =NULL to the predict.

    test.pred <- predict(fit, newdata = test, type="prob",na.action =NULL)
    nrow(test)
    #[1] 14
    nrow(test.pred)
    #[1] 14