Search code examples
rpredictrpart

cannot predict from rpart


I have a matrix of features (in columns) where the last column is a class label. Observations are in rows.

I use rpart in R to build a decision tree over a subset of my data and test it with predict using the rest of the data. The code to learn the tree is

fTree <- rpart(feature$a ~ feature$m, data = feature[fold != k, ],
  method = "class", parms = list(split = "gini"))

The code to test it is

predFeature <- predict(fTree, newdata = feature[fold == k, ],
  type = "class")

where k is an integer that I use to select a subset of the data, while fold is a matrix I use to create different subsets.

I get a warning message that I know some of you know already:

'newdata' had 306 rows but variables found have 3063 rows.

I read a post related to this but I failed in understanding the reason. So, further help is appreciated. Thanks in advance.


Solution

  • It is hard to say for sure because your example is not reproducible but I am rather certain that the problem is the following: You have fitted your tree with

    rpart(feature$a ~ feature$m, data = feature[fold != k, ], ...)
    

    Thus, the dependent variable is always feature$a from the full feature data set (which apparently has 3063 observation) and not from the subset feature[fold != k, ]. This works without error but is not the tree you wanted to fit. Consequently, the prediction is surprised because the newdata just has 306 observations but then these are not used but still the full data set due to the hard-coded feature$a in the formula.

    Using

    rpart(a ~ m, data = feature[fold != k, ], ...)
    

    is easier to read, less to type, and should fix the problems you observe.