Search code examples
rr-caretrpart

r caret predict returns fewer output than input


I used caret to train an rpart model below.

trainIndex <- createDataPartition(d$Happiness, p=.8, list=FALSE)
dtrain <- d[trainIndex, ]
dtest <- d[-trainIndex, ]
fitControl <- trainControl(## 10-fold CV
  method = "repeatedcv", number=10, repeats=10)
fitRpart <- train(Happiness ~ ., data=dtrain, method="rpart",
                trControl = fitControl)
testRpart <- predict(fitRpart, newdata=dtest)

dtest contains 1296 observations, so I expected testRpart to produce a vector of length 1296. Instead it's 1077 long, i.e. 219 short.

When I ran the prediction on the first 220 rows of dtest, I got a predicted result of 1, so it's consistently 219 short.

Any explanation on why this is so, and what I can do to get a consistent output to the input?

Edit: d can be loaded from here to reproduce the above.


Solution

  • I downloaded your data and found what explains the discrepancy.

    If you simply remove the missing values from your dataset, the length of the outputs match:

    testRpart <- predict(fitRpart, newdata = na.omit(dtest))
    

    Note nrow(na.omit(dtest)) is 1103, and length(testRpart) is 1103. So you need a strategy to address missing values. See ?predict.rpart and the options for the na.action parameter to choose what you want.