I have a matrix of features (in columns) where the last column is a class label. Observations are in rows.
I use rpart
in R to build a decision tree over a subset of my data and test it with predict using the rest of the data. The code to learn the tree is
fTree <- rpart(feature$a ~ feature$m, data = feature[fold != k, ],
method = "class", parms = list(split = "gini"))
The code to test it is
predFeature <- predict(fTree, newdata = feature[fold == k, ],
type = "class")
where k
is an integer that I use to select a subset of the data, while fold
is a matrix I use to create different subsets.
I get a warning message that I know some of you know already:
'newdata' had 306 rows but variables found have 3063 rows.
I read a post related to this but I failed in understanding the reason. So, further help is appreciated. Thanks in advance.
It is hard to say for sure because your example is not reproducible but I am rather certain that the problem is the following: You have fitted your tree with
rpart(feature$a ~ feature$m, data = feature[fold != k, ], ...)
Thus, the dependent variable is always feature$a
from the full feature
data set (which apparently has 3063 observation) and not from the subset feature[fold != k, ]
. This works without error but is not the tree you wanted to fit. Consequently, the prediction is surprised because the newdata
just has 306 observations but then these are not used but still the full data set due to the hard-coded feature$a
in the formula.
Using
rpart(a ~ m, data = feature[fold != k, ], ...)
is easier to read, less to type, and should fix the problems you observe.