I used caret
to train an rpart
model below.
trainIndex <- createDataPartition(d$Happiness, p=.8, list=FALSE)
dtrain <- d[trainIndex, ]
dtest <- d[-trainIndex, ]
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv", number=10, repeats=10)
fitRpart <- train(Happiness ~ ., data=dtrain, method="rpart",
trControl = fitControl)
testRpart <- predict(fitRpart, newdata=dtest)
dtest
contains 1296 observations, so I expected testRpart
to produce a vector of length 1296. Instead it's 1077 long, i.e. 219 short.
When I ran the prediction on the first 220 rows of dtest
, I got a predicted result of 1, so it's consistently 219 short.
Any explanation on why this is so, and what I can do to get a consistent output to the input?
Edit: d
can be loaded from here to reproduce the above.
I downloaded your data and found what explains the discrepancy.
If you simply remove the missing values from your dataset, the length of the outputs match:
testRpart <- predict(fitRpart, newdata = na.omit(dtest))
Note nrow(na.omit(dtest))
is 1103, and length(testRpart)
is 1103. So you need a strategy to address missing values. See ?predict.rpart
and the options for the na.action parameter to choose what you want.